0% found this document useful (0 votes)
3 views

AGraphis Worth K Words: Euclideanizing Graph using Pure Transformer

The paper introduces GraphsGPT, a novel framework that utilizes a pure transformer model to convert Non-Euclidean graphs into Euclidean representations, termed Graph Words, and vice versa. This approach addresses challenges in graph representation and generation by employing a Graph2Seq encoder and GraphGPT decoder, achieving state-of-the-art results in various graph tasks. The framework demonstrates effective self-supervised learning capabilities and simplifies the generative process by using an edge-centric strategy, enhancing graph manipulation and mixing in Euclidean space.

Uploaded by

xiaohang007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

AGraphis Worth K Words: Euclideanizing Graph using Pure Transformer

The paper introduces GraphsGPT, a novel framework that utilizes a pure transformer model to convert Non-Euclidean graphs into Euclidean representations, termed Graph Words, and vice versa. This approach addresses challenges in graph representation and generation by employing a Graph2Seq encoder and GraphGPT decoder, achieving state-of-the-art results in various graph tasks. The framework demonstrates effective self-supervised learning capabilities and simplifies the generative process by using an edge-centric strategy, enhancing graph manipulation and mixing in Euclidean space.

Uploaded by

xiaohang007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Zhangyang Gao * 1 2 Daize Dong * 1 Cheng Tan 1 2 Jun Xia 1 2 Bozhen Hu 1 2 Stan Z. Li 1

Abstract 2024a;b; Tan et al., 2023; Gao et al., 2022a;b; 2023; Lin
Can we model Non-Euclidean graphs as pure lan- et al., 2022a). The Non-Euclidean nature of graphs has in-
guage or even Euclidean vectors while retaining spired sophisticated model designs, including graph neural
arXiv:2402.02464v3 [cs.LG] 29 May 2024

their inherent information? The Non-Euclidean networks (Kipf & Welling, 2016a; Veličković et al., 2017)
property have posed a long term challenge in and graph transformers (Ying et al., 2021; Min et al., 2022).
graph modeling. Despite recent graph neural These models excel in encoding graph structures through
networks and graph transformers efforts encod- attention maps. However, the structural encoding strategies
ing graphs as Euclidean vectors, recovering the limit the usage of auto-regressive mechanism, thereby hin-
original graph from vectors remains a challenge. dering pure transformer from revolutionizing graph fields,
In this paper, we introduce GraphsGPT, featur- akin to the success of Vision Transformers (ViT) (Doso-
ing an Graph2Seq encoder that transforms Non- vitskiy et al., 2020) in computer vision. We employ pure
Euclidean graphs into learnable Graph Words in transformer for graph modeling and address the following
the Euclidean space, along with a GraphGPT de- open questions: (1) How to eliminate the Non-Euclidean na-
coder that reconstructs the original graph from ture to facilitate graph representation? (2) How to generate
Graph Words to ensure information equivalence. Non-Euclidean graphs from Euclidean representations? (3)
We pretrain GraphsGPT on 100M molecules and Could the combination of graph representation and genera-
yield some interesting findings: (1) The pre- tion framework benefits from self-supervised pretraining?
trained Graph2Seq excels in graph representation We present Graph2Seq, a pure transformer encoder designed
learning, achieving state-of-the-art results on 8/9 to compress the Non-Euclidean graph into a sequence of
graph classification and regression tasks. (2) The learnable tokens called Graph Words in a Euclidean form,
pretrained GraphGPT serves as a strong graph where all nodes and edges serve as the inputs and undergo an
generator, demonstrated by its strong ability to initial transformation to form Graph Words. Different from
perform both few-shot and conditional graph gen- graph transformers (Ying et al., 2021), our approach doesn’t
eration. (3) Graph2Seq+GraphGPT enables ef- necessitate explicit encoding of the adjacency matrix and
fective graph mixup in the Euclidean space, over- edge features in the attention map. Unlike TokenGT (Kim
coming previously known Non-Euclidean chal- et al., 2022), we introduce a Codebook featuring learnable
lenges. (4) The edge-centric pretraining frame- vectors for graph position encoding, leading to improved
work GraphsGPT demonstrates its efficacy in training stability and accelerated convergence. In addition,
graph domain tasks, excelling in both representa- we employ a random shuffle of the position Codebook, im-
tion and generation. Code is available at GitHub. plicitly augmenting different input orders for the same graph,
and offering each position vector the same opportunity of
1. Introduction optimization to generalize to larger graphs.

Graphs, inherent to Non-Euclidean data, are extensively We introduce GraphGPT, a groundbreaking GPT-style trans-
applied in scientific fields such as molecular design, social former model for graph generation. To recover the Non-
network analysis, recommendation systems, and meshed 3D Euclidean graph structure, we propose an edge-centric gen-
surfaces (Shakibajahromi et al., 2024; Zhou et al., 2020a; eration strategy that utilizes block-wise causal attention to
Huang et al., 2022; Tan et al., 2023; Li et al., 2023a; Liu sequentially generate the graph. Contrary to previous meth-
et al., 2023a; Xia et al., 2022b;b; Gao et al., 2022a; Wu et al., ods (Hu et al., 2020a; Shi et al., 2019; Peng et al., 2022) that
generate nodes before predicting edges, the edge-centric
*
Equal contribution 1 Westlake University, Hangzhou, China technique jointly generates edges and their corresponding
2
Zhejiang University, Hangzhou, China. Correspondence to: Stan endpoint nodes, greatly simplifying the generative space. To
Z. Li <[email protected]>.
align graph generation with language generation, we imple-
Proceedings of the 41 st International Conference on Machine ment auto-regressive generation using block-wise causal at-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by tention, which enables the effective translation of Euclidean
the author(s).

1
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

representations into Non-Euclidean graph structures. position embeddings and attention mechanisms. For in-
stance, Dwivedi & Bresson (2020); Hussain et al. (2021)
Leveraging Graph2Seq encoder and GraphGPT decoder, we
adopt Laplacian eigenvectors and SVD vectors of the adja-
present GraphsGPT, an integrated end-to-end framework.
cency matrix as position encoding vectors. Dwivedi & Bres-
This framework facilitates a natural self-supervised task to
son (2020); Mialon et al. (2021); Ying et al. (2021); Zhao
optimize the representation and generation tasks, enabling
et al. (2021) enhance the attention computation based on the
the transformation between Non-Euclidean and Euclidean
adjacency matrix. Recently, Kim et al. (2022) introduced
data structures. We pretrain GraphsGPT on 100M molecule
a decoupled position encoding method that empowers the
graphs and comprehensively evaluate it from three perspec-
pure transformer as strong graph learner without the needs
tives: Encoder, Decoder, and Encoder-Decoder. The pre-
of expensive computation of eigenvectors and modifications
trained Graph2Seq encoder is a strong graph learner for
on the attention computation.
property prediction, outperforming baselines of sophisti-
cated methodologies on 8/9 molecular classification and Graph Self-Supervised Learning. The exploration of
regression tasks. The pretrained GraphGPT decoder serves self-supervised pretext tasks for learning expressive graph
as a powerful structure prior, showcasing both few-shot and representations has garnered significant research interest
conditional generation capabilities. The GraphsGPT frame- (Wu et al., 2021a; Liu et al., 2022; 2021c; Xie et al., 2022).
work seamlessly connects the Non-Euclidean graph space Contrastive (You et al., 2020; Zeng & Xie, 2021; Qiu et al.,
to the Euclidean vector space while preserving informa- 2020; Zhu et al., 2020; 2021; Peng et al., 2020b; Liu et al.,
tion, facilitating tasks that are known to be challenging in 2023c;b; Lin et al., 2022b; Xia et al., 2022a; Zou et al., 2022)
the original graph space, such as graph mixup. The good and predictive (Peng et al., 2020a; Jin et al., 2020; Hou et al.,
performance of pretrained GraphsGPT demonstrates that 2022; Tian et al., 2023; Hwang et al., 2020; Wang et al.,
our edge-centric GPT-style pretraining task offers a sim- 2021) objectives have been extensively explored, leveraging
ple yet powerful solution for graph learning. In summary, strategies from the fields of NLP and CV. However, the
we tame pure transformer to convert Non-Euclidean graph discussion around generative pretext tasks (Hu et al., 2020a;
into K learnable Graph Words , showing the capabilities Zhang et al., 2021) for graphs is limited, particularly due to
of Graph2Seq encoder and GraphGPT decoder pretrained the Non-Euclidean nature of graph data, which has led to few
through self-supervised tasks, while also paving the way for instances of pure transformer utilization in graph generation.
various Non-Euclidean challenges like graph manipulation This paper introduces an innovative approach by framing
and graph mixing in Euclidean latent space. graph generation as analogous to language generation, thus
enabling the use of a pure transformer to generate graphs as
a novel self-supervised pretext task.
2. Related Work
Motivation. The pure transformer has revolutionized the
Graph2Vec. Graph2Vec methods create the graph embed- modeling of texts (Devlin et al., 2018; Brown et al., 2020;
ding by aggregating node embeddings via graph pooling Achiam et al., 2023), images (Dosovitskiy et al., 2020;
(Lee et al., 2019; Ma et al., 2019; Diehl, 2019; Ying et al., Alayrac et al., 2022; Dehghani et al., 2023; Liu et al., 2021d),
2018). The node embeddings could be learned by either tra- and the point cloud (Li et al., 2023b; Yu et al., 2022; Pang
ditional algorithms (Ahmed et al., 2013; Grover & Leskovec, et al., 2022) in both representation and generation tasks.
2016; Perozzi et al., 2014; Kipf & Welling, 2016b; Chan- However, due to the Non-Euclidean nature, extending trans-
puriya & Musco, 2020; Xiao et al., 2020), or deep learn- formers to graphs typically necessitates the explicit incorpo-
ing based graph neural networks (GNNs) (Kipf & Welling, ration of structural information into the attention computa-
2016a; Hamilton et al., 2017; Wu et al., 2019; Chiang et al., tion. Such constraint results in following challenges:
2019; Chen et al., 2018; Xu et al., 2018), and graph trans-
formers (Ying et al., 2021; Hu et al., 2020c; Dwivedi & 1. Generation Challenge. When generating new nodes
Bresson, 2020; Rampášek et al., 2022; Chen et al., 2022). or bonds, the undergone graph structure changes, re-
These methods are usually designed for specific downstream sulting in a complete update of all graph embeddings
tasks and can not be used for general pretraining. from scratch for full attention mechanisms. Moreover,
an additional link predictor is required to predict po-
Graph Transformers. The success of extending trans- tential edges from a |V| × |V| search space.
former architectures from natural language processing
2. Non-Euclidean Challenge. Previous methods do not
(NLP) to computer vision (CV) has inspired recent works
provide Euclidean prototypes to fully describe graphs.
to apply transformer models in the field of graph learning
The inherent Non-Euclidean nature poses challenges
(Ying et al., 2021; Hu et al., 2020c; Dwivedi & Bresson,
for tasks like graph manipulation and mixing.
2020; Rampášek et al., 2022; Chen et al., 2022; Wu et al.,
2021b; Kreuzer et al., 2021; Min et al., 2022). To encode the 3. Representation Challenge. Limited by the generation
graph prior, these approaches introduce structure-inspired challenge, traditional graph self-supervised learning

2
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Non-Euclidean Euclidean
[GW 1] POS 1
Input Graph Vectors
Graph Word Prompts

Non-Euclidean

Graph Words
[GP] POS 1 4 [GW 1] Generated Graph
[GW k] POS k
3

<BOS> C 11
[GP] POS k 1 2 [GW k]

atom C 11 C-C 12

Graph2Seq Encoder

GraphGPT Decoder
C 11 atom

bond C-C 12 C 22
C-C 12 bond
atom C 22 C-C 23
C 22 atom

bond C 33
C-C 23
C-C 23 bond

atom C 33 C=O 34
C 33 atom

bond O 44
C=O 34
C=O 34 bond

atom O 44 C-C 13
O 44 atom

C-C 13 bond bond C-C 13 <EOS>

Token GPE Seg Seg Token GPE

Figure 1: The Overall framework of GraphsGPT. Graph2Seq encoder transforms the Non-Euclidean graph into Euclidean
Graph Words, which are further fed into GraphGPT decoder to auto-regressively generate the original Non-Euclidean graph.
Both Graph2Seq and GraphGPT employ pure transformer as the structure.

methods have typically focused on reconstructing cor- slight abuse of notation, we use eli and eri to represent the
rupted sub-features and sub-structures. They overlook left and right endpoint nodes of edge ei . For example, we
of learning from the entire graph potentially limits the have e1 = (el1 , er1 ) = (v1 , v2 ) in Figure 2. Inspired by (Kim
ability to capture the global topology. et al., 2022), we flatten the nodes and edges in a graph into
a Flexible Token Sequence (FTSeq) consisting of:
To tackle these challenges, we propose GraphsGPT, which
uses pure transformer to convert the Non-Euclidean graph 1. Graph Tokens. The stacked node and edge features
into a sequence of Euclidean vectors (Graph2Seq) while are represented by X = [XV ; XE ] ∈ Rn+n ,C . We

ensuring informative equivalence (GraphGPT). For the first utilize a token Codebook Bt to generate node and
time, we bridge the gap between graph and sequence mod- edge features, incorporating 118+92 learnable vectors.
eling in both representation and generation tasks. Specifically, we consider the atom type and bond type,
deferring the exploration of other properties, such as
3. Method the electric charge and chirality, for simplicity.
3.1. Overall Framework
Figure 1 outlines the comprehensive architecture of 2. Graph Position Encodings (GPE). The graph struc-
GraphsGPT, which consists of a Graph2Seq encoder ture is implicitly encoded through decoupled position
and a GraphGPT decoder. The Graph2Seq converts Non- encodings, utilizing a position Codebook Bp com-
Euclidean graphs into a series of learnable feature vectors, prising m learnable embeddings {o1 , o2 , · · · , om } ∈
named Graph Words. Following this, the GraphGPT uti- Rm,dp . The position encodings of node vi and edge ei
lizes these Graph Words to auto-regressively reconstruct are expressed as gvi = [ovi , ovi ] and gei = [oeli , oeri ],
the original Non-Euclidean graph. Both components, the respectively. Notably, gvl i = gvri = ovi , gel i = oeli ,
Graph2Seq and GraphGPT, incorporate the pure transformer and geri = oeri . To learn permutation-invariant features
structure and are pretrained via a GPT-style pretext task. and generalize to larger, unseen graphs, we randomly
shuffle the position Codebook, giving each vector an
3.2. Graph2Seq Encoder equal optimization opportunity.
Flexible Token Sequence (FTSeq). Denote G = (V, E)
as the input graph, where V = {v1 , · · · , vn } and E = 3. Segment Encodings (Seg). We introduce two learn-
{e1 , · · · , en′ } are sets of nodes and edges associated with able segment tokens, namely [node] and [edge],

features XV ∈ Rn,C and XE ∈ Rn ,C , respectively. With a to designate the token types within the FTSeq.

3
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Algorithm 1 Construction of Flexible Token Sequence 3.3. GraphGPT Decoder


Require: Canonical SMILES CS. How to ensure that the learned Graph Words are information-
Ensure: Flexible Token Sequence FTSeq. equivalent to the original Non-Euclidean graph? Previous
1: Convert canonical SMILES CS to graph G. graph self-supervised learning methods focused on sub-
2: Get the first node v1 in graph G by CS. graph generation and multi-view contrasting, which suffer
3: Initialize sequence FTSeq = [v1 ]. potential information loss due to insufficient capture of the
4: for ei in DFS(G, v1 ) do global graph topology. In comparison, we adopt a GPT-style
5: Update sequence FTSeq ← [FTSeq, ei ]. decoder to auto-regressively generate the whole graph from
6: if eri not in FTSeq then the learned Graph Words in a edge-centric manner.
7: Update sequence FTSeq ← [FTSeq, eri ].
8: end if GraphGPT Formulation. Given the learned Graph
9: end for Words W and the flexible token sequence FTSeq,
the complete data sequence is [W, [BOS], FTSeq] =
[w1 , w2 , · · · , wk , [BOS], v1 , e1 , v2 , · · · , ei ]. We define
4 node edge node edge node edge node edge Seg FTSeq1:i as the sub-sequence comprising edges with con-
3 11 12 22 23 33 34 44 13 GPE nected nodes up to ei :
1 2
C C-C C C-C C C=O O C-C Token
(
[v1 , e1 , · · · , ei , eri ], if eri is a new node
Figure 2: Graph to Flexible Sequence. FTSeq1:i = .
[v1 , e1 , · · · , ei ], otherwise
As depicted in Figure 2, we utilize the Depth-First Search (2)
(DFS) algorithm to convert a graph into a flexible token se- In an edge-centric perspective, we assert eri belongs to
quence, denoted as FTSeq = [v1 , e1 , v2 , e2 , v3 , e3 , v4 , e4 ], ei . If eri is a new node, it will be put after ei . Employ-
where the starting atom matches that in the canonical ing GraphGPT, we auto-regressively generate the complete
SMILES. Algorithm 1 provides a detailed explanation of FTSeq conditioned on W:
our approach. It is crucial to emphasize that the result-
FTSeq
1:i
ing FTSeq remains Non-Euclidean data, as the number of FTSeq1:i+1 ←−−−− − GraphGPT([W, [BOS], FTSeq1:i ]),
nodes and edges may vary across different graphs. (3)
where the notation above the left arrow signifies that the
Euclidean Graph Words. Is there a Euclidean repre- output FTSeq1:i+1 corresponds to FTSeq1:i .
sentation that can completely describe the Non-Euclidean
graph? Given the FTSeq and k graph prompts Edge-Centric Graph Generation. Nodes and edges are
[[GP]1 , [GP]2 , · · · , [GP]k ], we use pure transformer to the basic components of a graph. Traditional node-centric
learn a set of Graph Words W = [w1 , w2 , · · · , wk ]: graph generation methods divide the problem into two parts:
W = Graph2Seq([GP]1 , [GP]2 , · · · , [GP]k , FTSeq]), (1) Node Generation; (2) Link Prediction.
(1)
The token [GP]k is the sum of a learnable [GP] token and We argue that node-centric approaches lead to imbalanced
the k-th position encoding. The learned Graph Words W difficulties in generating new nodes and edges. For the
are ordered and of fixed length, analogous to a novel graph molecular generation, let |Dv | and |De | denote the num-
language created in the latent Euclidean space. ber of node and edge types, respectively. Also, let n and
n′ represent the number of nodes and edges. The step-
Graph Vocabulary. In the context of a molecular system, wise classification complexities for predicting the new node
the complete graph vocabulary for molecules encompasses: and edge are O(|Dv |) and O(n × |De |), respectively. No-
1. The Graph Word prompts [GP]; tably, we observe that O(n × |De |) ≫ O(|Dv |), indicating
2. Special tokens, including the begin-of-sequence token a pronounced imbalance in the difficulties of generating
[BOS], the end-of-sequence token [EOS], and the nodes and edges. Considering that O(|Dv |) and O(|De |)
padding token [PAD]; are constants, the overall complexity of node-centric graph
generation is O(n + n2 ).
3. The dictionary set of atom tokens Dv with a size of
|Dv | = 118, where the order of atoms is arranged by These approaches ignore the basic truism that naturally oc-
their atomic numbers, e.g., D6 is the atom C; curring and chemically valid bonds are sparse: there are only
4. The dictionary set of bond tokens De with a size of 92 different bonds (considering the endpoints) among 870M
|De | = 92, considering the endpoint atom types, e.g., molecules in the ZINC database (Irwin & Shoichet, 2005).
C-C and C-O are different types of bonds even though Given such an observation, we propose the edge-centric gen-
they are both single bonds. eration strategy that decouples the graph generation into:

4
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

(1) Edge Generation; will stop if ei+1 = [EOS]. Note that the edge position
(2) Left Node Attachment; (3) Right Node Placement. encoding [oeli+1 , oeri+1 ] remains undetermined. This infor-
mation will affect the connection of the generated edge
We provide a brief illustration of the three steps in Figure 3. to the existing graph, as well as the determination of new
The step-wise classification complexity of generating an atoms, i.e., left atom attachment and right atom placement.
edge is O(|De |). Once the edge is obtained, the model au-
tomatically infers the left node attachment and right node Training Token Generation. The first node and next edge
placement, relieving the generation from the additional bur- prediction tasks are optimized by the cross entropy loss:
X
Init Next Edge Attach Place Ltoken = − yi · log pi . (6)
i
C1 C C C1 C C C1 C2

Step 2: Left Node Attachment. For the newly predicted


Case 1: is a new atom
edge ei+1 , we further determine how it connects to existing
Case 2: is a historical atom
C
nodes. According to the principles of FTSeq construc-
C3 C C3
C3 tion, it is required that at least one endpoint of ei+1 con-
C C nects to existing atoms, namely the left atom eli+1 . Given
C
C1 C2 C1 C2
C1 C2 the set of previously generated atoms {v1 , v2 , · · · , vj }
and their corresponding graph position encodings Oj =
Figure 3: Overview of edge-centric graph generation. [ov1 , ov2 , · · · , ovj ] ∈ Rj,C in Bp′ , we predict the position
den of generating atom types and edge connections, result- encoding of the left node using a linear layer PredPosl (·):
ing in a reduced complexity of O(1). With edge-centric
generation, we balance the classification complexities of ĝel i+1 = PredPosl (hei+1 ) ∈ R1,C . (7)
predicting nodes and edge as constants. Notably, the overall
We compute the cosine similarity between ĝel i+1 and Oj by
generation complexity is reduced to O(n + n′ ).
cl = ĝel i+1 OTj ∈ Rt . The index of existing atoms that eli+1
Next, we introduce the edge-centric generation in detail. will attach to is ul = arg max cl . This process implicitly
infers edge connections by querying over existing atoms,
Step 0: First Node Initialization. The first node token of instead of generating all potential edges from scratch. We
FTSeq is generated by: update the graph position encoding of the left node as:
[BOS]
gel i+1 = ovul


 hv1 ←−−−− GraphGPT([W, [BOS]]) Left Node GPE. (8)

pv1 = Predv (hv1 )

. (4)

 v1 = arg max pv1 Node Type Step 3: Right Node Placement. As for the right node
eri+1 , we consider two cases: (1) it connects to one of the

gv1 = [o1 , o1 ] GPE

existing atoms; (2) it is a new atom. Similar to the step 2,
Here, Predv (·) denotes a linear layer employed for the we use a linear layer PredPosr (·) to predict the position
initial node generation, producing a predictive probabil- encoding of the right node:
ity vector pv1 ∈ R|Dv | . The output v1 corresponds to the
ĝeri+1 = PredPosr (hei+1 ) ∈ R1,C . (9)
predicted node type, and o1 represents the node position
encoding retrieved from the position Codebook Bp′ of the We get the cosine similarity score cr = ĝeri+1 OTj and the
decoder, where we should explicitly note that the encoder index of node with the highest similarity ur = arg max cr .
Codebook Bp and the decoder Codebook Bp′ are not shared. Given a predefined threshold ϵ, if ck > ϵ, we consider ei+1
is connected to vur , and update:
Step 1: Next Edge Generation. The edge-centric graph
generation method creates the next edge by: geri+1 = ovur Right Node GPE, Case 1; (10)
ei otherwise, eri+1 is a new atom vj+1 , and we set:

hei+1 ←− GraphGPT([W, [BOS], FTSeq1:i ])

pei+1 = Prede (hei+1 ) , geri+1 = oj+1 Right Node GPE, Case 2. (11)

ei+1 = arg max pei+1 Edge Type

(5) Finally, we update the FTSeq by:
where Prede is a linear layer for the next edge prediction, (
and pei+1 ∈ R|De |+1 is the predictive probability. ei+1 be- FTSeq ← [FTSeq, ei+1 ] Case 1
. (12)
longs to the set De ∪ {[EOS]}, and the generation process FTSeq ← [FTSeq, ei+1 , vj+1 ] Case 2

5
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

By default, we set ϵ = 0.5. • Euclidean Graph Words (Q3): What opportunities do


the Euclidean Graph Words offer that were previously
Training Node Attachment & Placement. We adopt a
considered challenging?
contrastive objective to optimize left node attachment and
right node placement problems. Taking left node attach-
ment as an example, given the ground truth t, i.e., the in- 4.2. Datasets
dex of the attached atom in the original graph, the posi-
tive score is s+ = eli+1 oTvt , while the negative scores are ZINC (Pretraining). To pretrain GraphsGPT , we select
′ ′
s− = |vec(OOT )| ∈ R|Bp |×(|Bp |−1) , where vec(·) is a flat- the ZINC database (Irwin & Shoichet, 2005) as our pretrain-
ten operation while ignoring the diagonal elements. The ing dataset, which contains a total of 870, 370, 225 (870M)
final contrastive loss is: molecules. we randomly shuffle and partition the dataset
into training (99.7%), validation (0.2%), and test sets (0.1%).
1 X −
Lattach = (1 − s+ ) + s . (13) The model does not traverse all the data during pretraining,
|Bp′ | × (|Bp | − 1)

i.e., a total of about 100M molecules are used.

Block-Wise Causal Attention. In our method, node gen-


MoleculeNet (Representation). Wu et al. (2018) is a
eration is closely entangled with edge generation. Specifi-
widely-used benchmark dataset for molecular property pre-
cally, on its initial occurrence, each node is connected to an
diction and drug discovery. It offers a diverse collection of
edge, creating what we term a block. From the block view,
property datasets ranging from quantum mechanics, phys-
we employ a causal mask for auto-regressive generation.
ical chemistry to biophysics and physiology. Both classi-
However, within each block, we utilize the full attention.
fication and regression tasks are considered. For rigorous
We show the block-wise causal attention in Figure 4.
evaluation, we employ standard scaffold splitting, as op-
posed to random scaffold splitting, for dataset partitioning.
edge C=O
edge C-C

edge C-C

edge C-C
node C

node C

node C

node O
[GW 1]

[GW 2]

[GW 3]

[BOS]
Key

Query
MOSES & ZINC-C (Generation). For few-shot gener-
[GW 1]
ation, we evaluate GraphsGPT on MOSES (Polykovskiy
B0 [GW 2]
et al., 2020) dataset, which is designed for benchmarking
[GW 3]
generative models. Following MOSES, we compute molecu-
B1 [BOS]
lar properties (LogP, SA, QED) and scaffolds for molecules
node C
collected from ZINC, obtaining ZINC-C. The dataset pro-
B2 edge C-C
vides a standardized set of molecules in SMILES format.
node C

edge C-C 4.3. Pretraining


B3
node C

edge C=O
Model Configurations. We adopt the transformer as our
B4 model structure. Both the Graph2Seq encoder and the
node O

B5 edge C-C
GraphGPT decoder consist of 8 transformer blocks with
8 attention heads. For all layers, we use Swish (Ramachan-
dran et al., 2017) as the activation function and RMSNorm
Figure 4: Block-Wise causal attention with grey cells in- (Zhang & Sennrich, 2019) as the normalizing function. The
dicating masked positions. Graph Words contribute to the hidden size is set to 512, and the length of the Graph Po-
generation through full attention, serving as prefix prompts. sition Encoding (GPE) is 128. The total number param-
eters of the model is 50M. Denote K as the number of
4. Experiments Graph Words, multiple versions of GraphsGPT, referred
to as GraphsGPT-KW, were pretrained. We mainly use
4.1. Experiment Settings GraphsGPT-1W, while we find that GraphsGPT-8W has
better encoding-decoding consistency (Section 6, Q2).
We extensively conduct experiments to assess GraphsGPT,
delving into the following questions:
Training Details. The GraphsGPT model undergoes train-
• Representation (Q1): Can Graph2Seq effectively learn ing for 100K steps with a global batch size of 1024 on
expressive graph representation through pretraining? 8 NVIDIA-A100s, utilizing AdamW optimizer with 0.1
weight decay, where β1 = 0.9 and β2 = 0.95. The maxi-
• Generation (Q2): Could pretrained GraphGPT serve mum learning rate is 1e−4 with 5K warmup steps, and the
as a strong structural prior model for graph generation? final learning rate decays to 1e−5 with cosine scheduling.

6
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Table 1: Results of molecular property prediction. We report the mean (standard deviation) metrics of 10 runs with standard
scaffold splitting (not random scaffold splitting). The best results and the second best are highlighted.

ROC-AUC ↑ RMSD ↓
Tox21 ToxCast Sider HIV BBBP Bace ESOL FreeSolv Lipo

# Molecules 7,831 8,575 1,427 41,127 2,039 1,513 1128 642 4200
# Tasks 12 617 27 1 1 1 1 1 1
No pretrain

GINs 74.6 (0.4) 61.7 (0.5) 58.2 (1.7) 75.5 (0.8) 65.7 (3.3) 72.4 (3.8) 1.050 (0.008) 2.082 (0.082) 0.683 (0.016)
Graph2Seq-1W 74.0 (0.4) 62.6 (0.3) 66.6 (1.1) 73.6 (3.4) 68.3 (1.4) 77.3 (1.2) 0.953 (0.025) 1.936 (0.246) 0.907 (0.021)
Relative gain to GIN -0.8% +1.4% +12.6% -2.6% +3.8% +6.3% +10.2% +7.5% -24.7%

InfoGraph (Sun et al., 2019) 73.3 (0.6) 61.8 (0.4) 58.7 (0.6) 75.4 (4.3) 68.7 (0.6) 74.3 (2.6)
GPT-GNN (Hu et al., 2020b) 74.9 (0.3) 62.5 (0.4) 58.1 (0.3) 58.3 (5.2) 64.5 (1.4) 77.9 (3.2)
EdgePred (Hamilton et al., 2017) 76.0 (0.6) 64.1 (0.6) 60.4 (0.7) 64.1 (3.7) 67.3 (2.4) 77.3 (3.5)
ContextPred (Hu et al., 2019) 73.6 (0.3) 62.6 (0.6) 59.7 (1.8) 74.0 (3.4) 70.6 (1.5) 78.8 (1.2)
GraphLoG (Xu et al., 2021) 75.0 (0.6) 63.4 (0.6) 59.6 (1.9) 75.7 (2.4) 68.7 (1.6) 78.6 (1.0)
G-Contextual (Rong et al., 2020) 75.0 (0.6) 62.8 (0.7) 58.7 (1.0) 60.6 (5.2) 69.9 (2.1) 79.3 (1.1)
G-Motif (Rong et al., 2020) 73.6 (0.7) 62.3 (0.6) 61.0 (1.5) 77.7 (2.7) 66.9 (3.1) 73.0 (3.3)
AD-GCL (Suresh et al., 2021) 74.9 (0.4) 63.4 (0.7) 61.5 (0.9) 77.2 (2.7) 70.7 (0.3) 76.6 (1.5)
JOAO (You et al., 2021) 74.8 (0.6) 62.8 (0.7) 60.4 (1.5) 66.6 (3.1) 66.4 (1.0) 73.2 (1.6) 1.120 (0.003) 0.708 (0.004)
Pretrain

SimGRACE (Xia et al., 2022a) 74.4 (0.3) 62.6 (0.7) 60.2 (0.9) 75.5 (2.0) 71.2 (1.1) 74.9 (2.0)
GraphCL (You et al., 2020) 75.1 (0.7) 63.0 (0.4) 59.8 (1.3) 77.5 (3.8) 67.8 (2.4) 74.6 (2.1) 0.947 (0.038) 2.233 (0.261) 0.739 (0.009)
GraphMAE (Hou et al., 2022) 75.2 (0.9) 63.6 (0.3) 60.5 (1.2) 76.5 (3.0) 71.2 (1.0) 78.2 (1.5)
3D InfoMax (Stärk et al., 2022) 74.5 (0.7) 63.5 (0.8) 56.8 (2.1) 62.7 (3.3) 69.1 (1.2) 78.6 (1.9) 0.894 (0.028) 2.337 (0.227) 0.695 (0.012)
GraphMVP (Liu et al., 2021b) 74.9 (0.8) 63.1 (0.2) 60.2 (1.1) 79.1 (2.8) 70.8 (0.5) 79.3 (1.5) 1.029 (0.033) 0.681 (0.010)
MGSSL (Zhang et al., 2021) 75.2 (0.6) 63.3 (0.5) 61.6 (1.0) 77.1 (4.5) 68.8 (0.6) 78.8 (0.9)
AttrMask (Hu et al., 2019) 75.1 (0.9) 63.3 (0.6) 60.5 (0.9) 73.5 (4.3) 65.2 (1.4) 77.8 (1.8) 1.100 (0.006) 2.764 (0.002) 0.739 (0.003)
MolCLR (Wang et al., 2022) 75.0 (0.2) 58.9 (1.4) 78.1 (0.5) 72.2 (2.1) 82.4 (0.9) 1.271 (0.040) 2.594 (0.249) 0.691 (0.004)
Graphformer (Rong et al., 2020) 74.3 (0.1) 65.4 (0.4) 64.8 (0.6) 62.5 (0.9) 70.0 (0.1) 82.6 (0.7) 0.983 (0.090) 2.176 (0.052) 0.817 (0.008)
Mole-BERT (Xia et al., 2023) 76.8 (0.5) 64.3 (0.2) 62.8 (1.1) 78.9 (3.0) 71.9 (1.6) 80.8 (1.4) 1.015 (0.030) 0.676 (0.017)
Relative gain to GIN +2.9% +6.0% +11.3% +4.8% +9.9% +14.1% +14.9% -4.5% +1.0%

Graph2Seq-1W 76.9 (0.3) 65.4 (0.5) 68.2 (0.9) 79.4 (3.9) 72.8 (1.5) 83.4 (1.0) 0.860 (0.024) 1.797 (0.237) 0.716 (0.019)
Pretrain

Relative gain to GIN +3.1% +6.0% +17.2% +5.2% +10.8% +15.2% +18.1% +13.7% -4.8%
Relative gain to Graph2Seq-1W +3.9% +4.5% +2.4% +7.9% +6.6% +7.9% +9.8% +7.2% +21.1%

4.4. Representation other 4 out of 9 cases, it achieves similar performance with


an absolute relative gain of less than 5%. In addition, pure
Can Graph2Seq effectively learn expressive graph represen-
transformer runs much faster than GNNs, i.e., we finish the
tation through pretraining?
pretraining of GraphsGPT within 6 hours using 8 A100.
Setting & Baselines. We finetune the pretrained GPT-Style Pretraining is All You Need. Pretrained
Graph2Seq-1W on the MoleculeNet dataset. The learned Graph2Seq demonstrates a non-trivial improvement over 8
Graph Words are input into a linear layer for graph classifica- out of 9 datasets when compared to baselines. These results
tion or regression. We adhere to standard scaffold splitting are achieved without employing complex pretraining strate-
(not random scaffold splitting) for rigorous and meaning- gies such as multi-pretext combination and hard-negative
ful comparison. We do not incorporate the 3D structure of sampling, highlighting that GPT-pretraining alone is suf-
molecules for modeling. Recent strong molecular graph ficient for achieving SOTA performance and providing a
pretraining baselines are considered for comparison. simple yet effective solution for graph SSL.
We show property prediction results in Table 1, finding that: Graph2Seq Benefits More from GPT-Style Pretraining.
The non-trivial improvement has not been observed by pre-
Pure Transformer is Competitive to GNN. Without pre- vious GPT-GNN (Hu et al., 2020b), which adopts a node-
training, Graph2Seq-1W demonstrates a comparable perfor- centric generation strategy and GNN architectures. This
mance to GNN. Specifically, in 4 out of 9 cases, Graph2Seq- suggests that the transformer model is more suitable for
1W outperforms GIN with gains exceeding 5%, and in an-

7
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

scaling to large datasets. In addition, previous pretrained 12


QED

0.5
1.4 logP

0.0

transformers without the GPT-style pretraining (Rong et al., 10


0.7

0.9

Dataset
1.2 2.0

4.0

Dataset
1.0

2020) perform worse than Graph2Seq. This underscores that 8

Density

Density
0.8

generating the entire graph enhances the learning of global 6


0.6

topology and results in more expressive representations. 4


0.4

2
0.2

0 0.0

4.5. Generation 0.2 0.4 0.6 0.8 1.0 1.2 2 0 2 4 6

Could pretrained GraphGPT serve as a strong structural (a) QED (b) logP
prior model for graph generation? Figure 5: Property distribution of generated molecules on
different conditions using GraphsGPT-1W-C. “Dataset” de-
notes the distribution of the training dataset (ZINC-C).
GraphGPT Generates Novel Molecules with High Valid- Table 3: Comparison with MolGPT on different properties.
ity. We assess pretrained GraphGPT-1W on the MOSES “MAD” denotes the Mean Absolute Deviation in generated
dataset through few-shots generation without finetuning. By molecule properties compared to the oracle value. “SD”
extracting Graph Word embeddings {hi }M i=1 from M train- denotes the Standard Deviation of the generated property.
ing molecules,
PM we construct a mixture Gaussian distribution
p(h, s) = i=1 N (h i , sI), where s is the standard vari- Pretrain Metric QED=0.5 SA=0.7 logP=0.0 Avg.
ance. We sample M molecules from p(h, s) and report the MAD ↓ 0.081 0.024 0.304 0.136

MolGPT
% SD ↓ 0.065 0.022 0.295 0.127
validity, uniqueness, novelty and IntDiv in Table 2. We ob- Validity ↑ 0.985 0.975 0.982 0.981
serve that GraphGPT generates novel molecules with high MAD ↓ 0.041 0.012 0.103 0.052

GraphGPT-1W-C
validity. Without any finetuning, GraphGPT outperforms % SD ↓ 0.079 0.055 0.460 0.198
Validity ↑ 0.988 0.995 0.980 0.988
MolGPT on validity, uniqueness, novelty, and diversity. Def- MAD ↓ 0.032 0.002 0.017 0.017
" SD ↓ 0.080 0.042 0.404 0.175
inition of metrics could be found in the Appendix B. Validity ↑ 0.996 0.995 0.994 0.995

4.6. Euclidean Graph Words


Table 2: Few-shot generation results of GraphGPT-1W. We
use M = 100K shots and sample the same number of Graph What opportunities do the Euclidean Graph Words offer that
Word embeddings under different variance s. were previously considered challenging?

Model Validity ↑ Unique ↑ Novelty ↑ IntDiv1 ↑ IntDiv2 ↑


For graph classification, let the i-th sample be denoted as
HMM 0.076 0.567 0.999 0.847 0.810 (Gi , yi ), where Gi and yi represent the graph and one-hot
NGram 0.238 0.922 0.969 0.874 0.864 label, respectively. When considering paired graphs (Gi , yi )
Unconditional

Combinatorial 1.0 0.991 0.988 0.873 0.867


CharRNN 0.975 0.999 0.842 0.856 0.850 and (Gj , yj ), and employing a mixing ratio λ sampled from
VAE 0.977 0.998 0.695 0.856 0.850 the Beta(α, α) distribution, the mixed label is defined as
AEE 0.937 0.997 0.793 0.856 0.850
LatentGAN 0.897 0.997 0.949 0.857 0.850 ymix = λyi + (1 − λ)yj . However, due to the irregular, un-
JT-VAE 1.0 0.999 0.914 0.855 0.849
MolGPT 0.994 1.0 0.797 0.857 0.851
aligned, and Non-Euclidean nature of graph data, applying
GraphGPT-1Ws=0.25 0.995 0.995 0.255 0.854 0.850 mixup to get Gmix is nontrivial. Recent efforts (Zhou et al.,
Few Shot

GraphGPT-1Ws=0.5 0.993 0.996 0.334 0.856 0.848 2020b; Park et al., 2022; Wu et al., 2022; Zhang et al., 2023;
GraphGPT-1Ws=1.0 0.978 0.997 0.871 0.860 0.857
GraphGPT-1Ws=2.0 0.972 1.0 1.0 0.850 0.847 Guo & Mao, 2023) have attempted to address this challenge
by introducing complex hand-crafted rules. Additionally,
G-mixup (Han et al., 2022) leverages estimated graphons
for generating mixed graphs. To our best knowledge, there
GraphGPT-C is a Controllable Molecule Generator. are currently no learnable model for mixing in Euclidean
Following (Bagal et al., 2021), we finetune GraphsGPT- space while generating new graphs.
1W on 100M molecules from ZINC-C with properties and
Table 4: Graph mixup results. We compare Graph2Seq with
scaffolds as prefix inputs, obtaining GraphsGPT-1W-C. We
G-mixup on multiple tasks from MoleculeNet.
access whether the model could generate molecules satis-
fying specified properties. We present summarized results mixup HIV ↑ BBBP ↑ Bace ↑ Tox21 ↑ ToxCast ↑ Sider ↑
in Figure 5 and Table 3, while providing the full results % 77.1 68.4 75.9
G-Mix

in the appendix due to space limit. The evaluation is con- " 77.1 70.2 77.8
gain +0.0 +1.8 +1.9
ducted using the scaffold “c1ccccc1”, demonstrating that % 79.4 72.8 83.4 76.9 65.4 68.2
Ours

GraphGPT can effectively control the properties of gen- " 79.8 73.4 85.4 77.2 65.5 68.9
gain +0.4 +0.6 +2.0 +0.3 +0.1 +0.7
erated molecules. Table 3 further confirms that unsuper-
vised pretraining enhances the controllability and validity GraphsGPT is a Competitive Graph Mixer. We mixup
of GraphGPT. More details can be found in Appendix B.2. the learned Graph Words encoded by Graph2Seq-1W, then

8
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

generate the mixed graph using GraphGPT-1W. Formally, Table 6: Self-consistency of decoded sequences. “C@N ”
the Graph Words of Gi and Gj are Wi = Graph2Seq(Gi ) denotes the decoded results of N out of the total 1024 per-
and Wj = Graph2Seq(Gj ), and the mixed graph is Gmix = mutations for each molecule are consistent. “Avg.” denotes
GraphGPT(λWi + (1 − λ)Wj ). We conduct experiments the average consistency of all test data.
on MoleculeNet and show the results in Table 4. We ob- Models C@256 C@512 C@768 C@1024 Avg.
serve that the straightforward latent mixup outperforms the GraphsGPT-1W 100% 99.2% 94.1% 77.3% 96.1%
elaborately designed G-mixup proposed in the ICML’22 GraphsGPT-8W 100% 99.4% 96.5% 85.3% 97.9%
outstanding paper (Han et al., 2022).
Due to page limit, more results are moved to the appendix.

plicitly state that this random shuffle of position vectors is


5. Conclusion equivalent to randomly shuffling the input order of atoms.
We propose GraphsGPT, the first framework with pure This allows the model to learn from the data with random
transformer that converts Non-Euclidean graph into Eu- order augmentation. We point that building a permutation-
clidean representations, while preserving information using invariant encoder is easy and necessary, however, develop-
an edge-centric GPT-style pretraining task. We show that ing a decoder with permutation invariance poses a signifi-
the Graph2Seq and GraphGPT serve as strong graph learn- cant challenge for auto-regressive generation models. We
ers for representation and generation, respectively. The randomly shuffle the position vectors, allowing the model
Euclidean representations offer more opportunities previ- to learn representations with different orders for molecules.
ously known to be challenging. The GraphsGPT may create To further verify the effectiveness of our method in handling
a new paradigm of graph modeling. the permutation invariance, we conduct an additional ex-
periment. Given an input molecular graph sequence, we
6. Rebuttal Details randomly permute its order for 1024 times and encode
the shuffled sequences with Graph2Seq, obtaining a set
Q1 Missing discussion on diffusion-based molecular gen- of 1024 Graph Words. We then decode them back to the
erative models. graph sequences and observe the consistency, which is de-
R1 We conduct additional experiments following (Kong fined as the maximum percentage of the decoded sequences
et al., 2023) to compare GraphGPT-1W with the diffusion- that share the same results. Table 6 shows the results on
based methods on ZINC-250K. We follow the same few- 1000 molecules from the test set, where we find both mod-
shots generation setting described in the Section 4.5, where els are resistant to a certain degree of permutation invari-
we set M = 10K for fair comparison. As shown in Table ance, i.e., 96.1% and 97.9% of the average consistency for
5, we find that GraphGPT-1W surpasses these methods in a GraphsGPT-1W and GraphsGPT-8W, respectively.
large margin on various metrics, which can further validate In addition, there is a contradiction between permutation-
the strong generation ability of GraphGPT. invariant model and auto-regressive model. Previous
work (TokenGT (Kim et al., 2022)) focuses on represen-
Table 5: Comparison with diffusion-based methods on tation learning, therefore, do not suffer from the issue of
ZINC-250K. We use M = 10K shots and sample the same permutation-invariant. We combine representation with gen-
number of Graph Word under different variance s. eration tasks in the same model, and propose the technique
of randomly shuffling position vectors so that all tasks can
Model Valid ↑ Unique ↑ Novel ↑ NSPDK ↓ FCD ↓
GraphAF (Shi et al., 2020) 68.47 98.64 100 0.044 16.02 work well. We should note that randomly shuffling the po-
GraphDF (Luo et al., 2021) 90.61 99.63 100 0.177 33.55
MoFlow (Zang & Wang, 2020) 63.11 99.99 100 0.046 20.93
sition vector Codebook is more effective than shuffling the
EDP-GNN (Niu et al., 2020) 82.97 99.79 100 0.049 16.74 atom order itself. Readers can read the openreview rebuttal.
GraphEBM (Liu et al., 2021a) 5.29 98.79 100 0.212 35.47
SPECTRE (Martinkus et al., 2022) 90.20 67.05 100 0.109 18.44
GDSS (Jo et al., 2022) 97.01 99.64 100 0.019 14.66
DiGress (Vignac et al., 2022) 91.02 81.23 100 0.082 23.06 Acknowledgements
GRAPHARM (Kong et al., 2023) 88.23 99.46 100 0.055 16.26
GraphGPT-1Ws=0.25 99.67 99.95 93.0 0.0002 1.78
GraphGPT-1Ws=0.5 99.57 99.97 93.6 0.0003 1.79 This work was supported by National Science and Technol-
GraphGPT-1Ws=1.0 98.44 100 98.0 0.0012 2.89 ogy Major Project (No. 2022ZD0115101), National Natural
GraphGPT-1Ws=2.0 97.64 100 100 0.0056 8.47
Science Foundation of China Project (No. U21A20427),
Q2 How do the method consider the symmetry of graphs? Project (No. WU2022A009) from the Center of Synthetic
Graph data is invariant to permutation. Biology and Integrated Bioengineering of Westlake Univer-
sity and Integrated Bioengineering of Westlake University
R2 In Section 3.2, we mention that “we introduce a ran- and Project (No. WU2023C019) from the Westlake Univer-
dom shuffle of the position Codebook”. We should ex- sity Industries of the Future Research Funding .

9
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Impact Statement Chen, J., Ma, T., and Xiao, C. Fastgcn: fast learning with
graph convolutional networks via importance sampling.
This paper presents work whose goal is to advance the field arXiv preprint arXiv:1801.10247, 2018.
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S., and Hsieh,
specifically highlighted here. GraphsGPT provides a new C.-J. Cluster-gcn: An efficient algorithm for training deep
paradigm for graph representation, generation and manip- and large graph convolutional networks. In Proceedings
ulation. The Non-Euclidean to Euclidean transformation of the 25th ACM SIGKDD international conference on
may affect broader downstream graph applications, such as knowledge discovery & data mining, pp. 257–266, 2019.
graph translation and optimization. The methodology could
be extend to other modalities, such as image and sequence. Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P.,
Heek, J., Gilmer, J., Steiner, A. P., et al. Scaling vision
transformers to 22 billion parameters. In ICML, pp. 7480–
References 7512. PMLR, 2023.
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Pre-training of deep bidirectional transformers for lan-
Anadkat, S., et al. Gpt-4 technical report. arXiv preprint guage understanding. arXiv:1810.04805, 2018.
arXiv:2303.08774, 2023.
Diehl, F. Edge contraction pooling for graph neural net-
Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josi- works. arXiv preprint arXiv:1905.10990, 2019.
fovski, V., and Smola, A. J. Distributed large-scale nat-
ural graph factorization. In Proceedings of the 22nd Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
international conference on World Wide Web, pp. 37–48, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
2013. Heigold, G., Gelly, S., et al. An image is worth 16x16
words: Transformers for image recognition at scale. arXiv
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., preprint arXiv:2010.11929, 2020.
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds,
M., et al. Flamingo: a visual language model for few-shot Dwivedi, V. P. and Bresson, X. A generalization
learning. Advances in Neural Information Processing of transformer networks to graphs. arXiv preprint
Systems, 35:23716–23736, 2022. arXiv:2012.09699, 2020.

Bagal, V., Aggarwal, R., Vinod, P., and Priyakumar, Gao, Z., Tan, C., and Li, S. Z. Pifold: Toward effective
U. D. Molgpt: molecular generation using a transformer- and efficient protein inverse folding. In The Eleventh
decoder model. Journal of Chemical Information and International Conference on Learning Representations,
Modeling, 62(9):2064–2076, 2021. 2022a.

Brown, N., Fiscato, M., Segler, M. H., and Vaucher, A. C. Gao, Z., Tan, C., Wu, L., and Li, S. Z. Simvp: Simpler yet
Guacamol: benchmarking models for de novo molecular better video prediction. In Proceedings of the IEEE/CVF
design. Journal of chemical information and modeling, conference on computer vision and pattern recognition,
59(3):1096–1108, 2019. pp. 3170–3180, 2022b.
Gao, Z., Tan, C., Chen, X., Zhang, Y., Xia, J., Li, S., and
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Li, S. Z. Kw-design: Pushing the limit of protein deign
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
via knowledge refinement. In The Twelfth International
Askell, A., et al. Language models are few-shot learners.
Conference on Learning Representations, 2023.
Advances in neural information processing systems, 33:
1877–1901, 2020. Grover, A. and Leskovec, J. node2vec: Scalable feature
learning for networks. In Proceedings of the 22nd ACM
Chanpuriya, S. and Musco, C. Infinitewalk: Deep network
SIGKDD international conference on Knowledge discov-
embeddings as laplacian embeddings with a nonlinearity.
ery and data mining, pp. 855–864, 2016.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. Guo, H. and Mao, Y. Interpolating graph pair to regularize
1325–1333, 2020. graph classification. In AAAI, volume 37, pp. 7766–7774,
2023.
Chen, D., O’Bray, L., and Borgwardt, K. Structure-aware
transformer for graph representation learning. In Interna- Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-
tional Conference on Machine Learning, pp. 3469–3489. sentation learning on large graphs. Advances in neural
PMLR, 2022. information processing systems, 30, 2017.

10
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Han, X., Jiang, Z., Liu, N., and Hu, X. G-mixup: Graph Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and
data augmentation for graph classification. In ICML, pp. Hong, S. Pure transformers are powerful graph learners.
8230–8248. PMLR, 2022. Advances in Neural Information Processing Systems, 35:
14582–14595, 2022.
Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C.,
and Tang, J. Graphmae: Self-supervised masked graph Kipf, T. N. and Welling, M. Semi-supervised classifica-
autoencoders. In Proceedings of the 28th ACM SIGKDD tion with graph convolutional networks. arXiv preprint
Conference on Knowledge Discovery and Data Mining, arXiv:1609.02907, 2016a.
pp. 594–604, 2022.
Kipf, T. N. and Welling, M. Variational graph auto-encoders.
Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., arXiv preprint arXiv:1611.07308, 2016b.
and Leskovec, J. Strategies for pre-training graph neural
networks. arXiv preprint arXiv:1905.12265, 2019. Kong, L., Cui, J., Sun, H., Zhuang, Y., Prakash, B. A., and
Zhang, C. Autoregressive diffusion model for graph gen-
Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. Gpt-
eration. In International conference on machine learning,
gnn: Generative pre-training of graph neural networks.
pp. 17391–17408. PMLR, 2023.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp.
Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., and
1857–1867, 2020a.
Tossou, P. Rethinking graph transformers with spectral
Hu, Z., Dong, Y., Wang, K., Chang, K.-W., and Sun, Y. Gpt- attention. Advances in Neural Information Processing
gnn: Generative pre-training of graph neural networks. Systems, 34:21618–21629, 2021.
In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp. Lee, J., Lee, I., and Kang, J. Self-attention graph pooling. In
1857–1867, 2020b. International conference on machine learning, pp. 3734–
3743. PMLR, 2019.
Hu, Z., Dong, Y., Wang, K., and Sun, Y. Heterogeneous
graph transformer. In Proceedings of the web conference Li, X., Sun, L., Ling, M., and Peng, Y. A survey of graph
2020, pp. 2704–2710, 2020c. neural network based recommendation in social networks.
Neurocomputing, pp. 126441, 2023a.
Huang, Y., Peng, X., Ma, J., and Zhang, M. 3dlinker: an
e (3) equivariant variational autoencoder for molecular Li, Z., Gao, Z., Tan, C., Li, S. Z., and Yang, L. T. General
linker design. arXiv preprint arXiv:2205.07309, 2022. point model with autoencoding and autoregressive. arXiv
preprint arXiv:2310.16861, 2023b.
Hussain, M. S., Zaki, M. J., and Subramanian, D. Edge-
augmented graph transformers: Global self-attention is Lin, H., Gao, Z., Xu, Y., Wu, L., Li, L., and Li, S. Z. Condi-
enough for graphs. arXiv preprint arXiv:2108.03348, tional local convolution for spatio-temporal meteorologi-
2021. cal forecasting. In Proceedings of the AAAI conference on
artificial intelligence, volume 36, pp. 7470–7478, 2022a.
Hwang, D., Park, J., Kwon, S., Kim, K., Ha, J.-W., and
Kim, H. J. Self-supervised auxiliary learning with meta-
Lin, Z., Tian, C., Hou, Y., and Zhao, W. X. Improving
paths for heterogeneous graphs. Advances in Neural
graph collaborative filtering with neighborhood-enriched
Information Processing Systems, 33:10294–10305, 2020.
contrastive learning. In Proceedings of the ACM Web
Irwin, J. J. and Shoichet, B. K. Zinc- a free database of Conference 2022, pp. 2320–2329, 2022b.
commercially available compounds for virtual screening.
Journal of chemical information and modeling, 45(1): Liu, C., Li, Y., Lin, H., and Zhang, C. Gnnrec: Gated graph
177–182, 2005. neural network for session-based social recommendation
model. Journal of Intelligent Information Systems, 60(1):
Jin, W., Derr, T., Liu, H., Wang, Y., Wang, S., Liu, 137–156, 2023a.
Z., and Tang, J. Self-supervised learning on graphs:
Deep insights and new direction. arXiv preprint Liu, M., Yan, K., Oztekin, B., and Ji, S. Graphebm: Molec-
arXiv:2006.10141, 2020. ular graph generation with energy-based models. arXiv
preprint arXiv:2102.00546, 2021a.
Jo, J., Lee, S., and Hwang, S. J. Score-based generative
modeling of graphs via the system of stochastic differen- Liu, S., Wang, H., Liu, W., Lasenby, J., Guo, H., and Tang,
tial equations. In International Conference on Machine J. Pre-training molecular graph representation with 3d
Learning, pp. 10362–10383. PMLR, 2022. geometry. arXiv preprint arXiv:2110.07728, 2021b.

11
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, Niu, C., Song, Y., Song, J., Zhao, S., Grover, A., and Ermon,
J., and Tang, J. Self-supervised learning: Generative or S. Permutation invariant graph generation via score-based
contrastive. IEEE transactions on knowledge and data generative modeling. In International Conference on Ar-
engineering, 35(1):857–876, 2021c. tificial Intelligence and Statistics, pp. 4474–4484. PMLR,
2020.
Liu, Y., Jin, M., Pan, S., Zhou, C., Zheng, Y., Xia, F., and
Philip, S. Y. Graph self-supervised learning: A survey. Pang, Y., Wang, W., Tay, F. E., Liu, W., Tian, Y., and Yuan,
IEEE Transactions on Knowledge and Data Engineering, L. Masked autoencoders for point cloud self-supervised
35(6):5879–5900, 2022. learning. In ECCV, pp. 604–621. Springer, 2022.
Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, S., Liang, K., Tu, Park, J., Shim, H., and Yang, E. Graph transplant: Node
W., and Li, L. Simple contrastive graph clustering. IEEE saliency-guided graph mixup with local structure preser-
Transactions on Neural Networks and Learning Systems, vation. In Proceedings of the AAAI Conference on Artifi-
2023b. cial Intelligence, volume 36, pp. 7966–7974, 2022.
Liu, Y., Yang, X., Zhou, S., Liu, X., Wang, Z., Liang, K., Peng, X., Luo, S., Guan, J., Xie, Q., Peng, J., and Ma, J.
Tu, W., Li, L., Duan, J., and Chen, C. Hard sample aware Pocket2mol: Efficient molecular sampling based on 3d
network for contrastive deep graph clustering. In Pro- protein pockets. In International Conference on Machine
ceedings of the AAAI conference on artificial intelligence, Learning, pp. 17644–17655. PMLR, 2022.
volume 37, pp. 8914–8922, 2023c.
Peng, Z., Dong, Y., Luo, M., Wu, X.-M., and Zheng, Q.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, Self-supervised graph representation learning via global
S., and Guo, B. Swin transformer: Hierarchical vision context prediction. arXiv:2003.01604, 2020a.
transformer using shifted windows. In ICCV, pp. 10012–
Peng, Z., Huang, W., Luo, M., Zheng, Q., Rong, Y., Xu, T.,
10022, 2021d.
and Huang, J. Graph representation learning via graphical
Luo, Y., Yan, K., and Ji, S. Graphdf: A discrete flow mutual information maximization. In Proceedings of The
model for molecular graph generation. In International Web Conference 2020, pp. 259–270, 2020b.
conference on machine learning, pp. 7192–7203. PMLR,
Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online
2021.
learning of social representations. In Proceedings of the
Ma, Y., Wang, S., Aggarwal, C. C., and Tang, J. Graph con- 20th ACM SIGKDD international conference on Knowl-
volutional networks with eigenpooling. In Proceedings edge discovery and data mining, pp. 701–710, 2014.
of the 25th ACM SIGKDD international conference on
Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golo-
knowledge discovery & data mining, pp. 723–731, 2019.
vanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Arta-
Martinkus, K., Loukas, A., Perraudin, N., and Wattenhofer, monov, A., Aladinskiy, V., Veselov, M., et al. Molecular
R. Spectre: Spectral conditioning helps to overcome the sets (moses): a benchmarking platform for molecular gen-
expressivity limits of one-shot graph generators. In In- eration models. Frontiers in pharmacology, 11:565644,
ternational Conference on Machine Learning, pp. 15159– 2020.
15179. PMLR, 2022.
Qiu, J., Chen, Q., Dong, Y., Zhang, J., Yang, H., Ding, M.,
McInnes, L. and Healy, J. Accelerated hierarchical density Wang, K., and Tang, J. Gcc: Graph contrastive coding for
based clustering. In Data Mining Workshops (ICDMW), graph neural network pre-training. In Proceedings of the
2017 IEEE International Conference on, pp. 33–42. IEEE, 26th ACM SIGKDD international conference on knowl-
2017. edge discovery & data mining, pp. 1150–1160, 2020.
McInnes, L., Healy, J., and Melville, J. Umap: Uniform Ramachandran, P., Zoph, B., and Le, Q. V. Searching for
manifold approximation and projection for dimension activation functions. arXiv:1710.05941, 2017.
reduction. arXiv preprint arXiv:1802.03426, 2018.
Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf,
Mialon, G., Chen, D., Selosse, M., and Mairal, J. Graphit: G., and Beaini, D. Recipe for a general, powerful, scal-
Encoding graph structure in transformers. arXiv preprint able graph transformer. Advances in Neural Information
arXiv:2106.05667, 2021. Processing Systems, 35:14501–14515, 2022.
Min, E., Chen, R., Bian, Y., Xu, T., Zhao, K., Huang, W., Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., and
Zhao, P., Huang, J., Ananiadou, S., and Rong, Y. Trans- Huang, J. Self-supervised graph transformer on large-
former for graphs: An overview from architecture per- scale molecular data. Advances in Neural Information
spective. arXiv preprint arXiv:2202.08455, 2022. Processing Systems, 33:12559–12571, 2020.

12
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Shakibajahromi, B., Kim, E., and Breen, D. E. Rimeshgnn: Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Wein-
A rotation-invariant graph neural network for mesh clas- berger, K. Simplifying graph convolutional networks. In
sification. In WACV, pp. 3150–3160, 2024. ICML, pp. 6861–6871. PMLR, 2019.

Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, Wu, L., Lin, H., Tan, C., Gao, Z., and Li, S. Z. Self-
J. Graphaf: a flow-based autoregressive model for molec- supervised learning on graphs: Contrastive, generative, or
ular graph generation. In International Conference on predictive. IEEE Transactions on Knowledge and Data
Learning Representations, 2019. Engineering, 2021a.

Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, Wu, L., Xia, J., Gao, Z., et al. Graphmixup: Improving class-
J. Graphaf: a flow-based autoregressive model for molec- imbalanced node classification by reinforcement mixup
ular graph generation. arXiv preprint arXiv:2001.09382, and self-supervised context prediction. In ECML-PKDD,
2020. pp. 519–535. Springer, 2022.

Stärk, H., Beaini, D., Corso, G., Tossou, P., Dallago, C., Wu, L., Huang, Y., Tan, C., Gao, Z., Hu, B., Lin, H.,
Günnemann, S., and Liò, P. 3d infomax improves gnns Liu, Z., and Li, S. Z. Psc-cpi: Multi-scale protein
for molecular property prediction. In ICML, pp. 20479– sequence-structure contrasting for efficient and gener-
20502. PMLR, 2022. alizable compound-protein interaction prediction. arXiv
preprint arXiv:2402.08198, 2024a.
Sun, F.-Y., Hoffmann, J., Verma, V., and Tang, J. Info-
graph: Unsupervised and semi-supervised graph-level Wu, L., Tian, Y., Huang, Y., Li, S., Lin, H., Chawla,
representation learning via mutual information maximiza- N. V., and Li, S. Z. Mape-ppi: Towards effective
tion. arXiv preprint arXiv:1908.01000, 2019. and efficient protein-protein interaction prediction via
microenvironment-aware protein embedding. arXiv
Suresh, S., Li, P., Hao, C., and Neville, J. Adversarial preprint arXiv:2402.14391, 2024b.
graph augmentation to improve graph contrastive learning.
Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Ge-
Advances in Neural Information Processing Systems, 34:
niesse, C., Pappu, A. S., Leswing, K., and Pande, V.
15920–15933, 2021.
Moleculenet: a benchmark for molecular machine learn-
Tan, C., Gao, Z., and Li, S. Z. Target-aware molecular graph ing. Chemical science, 9(2):513–530, 2018.
generation. In Joint European Conference on Machine
Wu, Z., Jain, P., Wright, M., Mirhoseini, A., Gonzalez,
Learning and Knowledge Discovery in Databases, pp.
J. E., and Stoica, I. Representing long-range context for
410–427. Springer, 2023.
graph neural networks with global attention. NeurIPS,
Tian, Y., Dong, K., Zhang, C., Zhang, C., and Chawla, N. V. 34:13266–13279, 2021b.
Heterogeneous graph masked autoencoders. In Proceed- Xia, J., Wu, L., Chen, J., Hu, B., and Li, S. Z. Simgrace: A
ings of the AAAI Conference on Artificial Intelligence, simple framework for graph contrastive learning without
volume 37, pp. 9997–10005, 2023. data augmentation. In Proceedings of the ACM Web
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Conference 2022, pp. 1070–1079, 2022a.
Lio, P., and Bengio, Y. Graph attention networks. arXiv Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., and
preprint arXiv:1710.10903, 2017. Li, S. Z. Mole-bert: Rethinking pre-training graph neural
networks for molecules. In The Eleventh International
Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher,
Conference on Learning Representations, 2022b.
V., and Frossard, P. Digress: Discrete denoising diffusion
for graph generation. arXiv preprint arXiv:2209.14734, Xia, J., Zhao, C., Hu, B., Gao, Z., Tan, C., Liu, Y., Li, S., and
2022. Li, S. Z. Mole-bert: Rethinking pre-training graph neural
networks for molecules. In The Eleventh International
Wang, P., Agarwal, K., Ham, C., Choudhury, S., and Reddy,
Conference on Learning Representations, 2023.
C. K. Self-supervised learning of contextual embeddings
for link prediction in heterogeneous networks. In Pro- Xiao, W., Zhao, H., Zheng, V. W., and Song, Y. Vertex-
ceedings of the web conference 2021, pp. 2946–2957, reinforced random walk for network embedding. In Pro-
2021. ceedings of the 2020 SIAM International Conference on
Data Mining, pp. 595–603. SIAM, 2020.
Wang, Y., Wang, J., Cao, Z., and Barati Farimani, A. Molec-
ular contrastive learning of representations via graph neu- Xie, Y., Xu, Z., Zhang, J., Wang, Z., and Ji, S. Self-
ral networks. NMI, 4(3):279–287, 2022. supervised learning of graph neural networks: A unified

13
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

review. IEEE transactions on pattern analysis and ma- Zhao, J., Li, C., Wen, Q., Wang, Y., Liu, Y., Sun, H., Xie, X.,
chine intelligence, 45(2):2412–2429, 2022. and Ye, Y. Gophormer: Ego-graph transformer for node
classification. arXiv preprint arXiv:2110.13094, 2021.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How
powerful are graph neural networks? arXiv preprint Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang,
arXiv:1810.00826, 2018. L., Li, C., and Sun, M. Graph neural networks: A review
of methods and applications. AI open, 1:57–81, 2020a.
Xu, M., Wang, H., Ni, B., Guo, H., and Tang, J. Self-
supervised graph-level representation learning with local Zhou, J., Shen, J., and Xuan, Q. Data augmentation for
and global structure. In International Conference on graph classification. In Proceedings of the 29th ACM
Machine Learning, pp. 11548–11558. PMLR, 2021. International Conference on Information & Knowledge
Management, pp. 2341–2344, 2020b.
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y.,
and Liu, T.-Y. Do transformers really perform badly for Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., and Wang, L. Deep
graph representation? Advances in Neural Information graph contrastive representation learning. arXiv preprint
Processing Systems, 34:28877–28888, 2021. arXiv:2006.04131, 2020.
Zhu, Y., Xu, Y., Yu, F., et al. Graph contrastive learning
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and
with adaptive augmentation. In Proceedings of the Web
Leskovec, J. Hierarchical graph representation learning
Conference 2021, pp. 2069–2080, 2021.
with differentiable pooling. Advances in neural informa-
tion processing systems, 31, 2018. Zou, D., Wei, W., Mao, X.-L., et al. Multi-level cross-view
contrastive learning for knowledge-aware recommender
You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., and Shen, Y.
system. In SIGIR, pp. 1358–1368, 2022.
Graph contrastive learning with augmentations. NeurIPS,
33:5812–5823, 2020.

You, Y., Chen, T., Shen, Y., and Wang, Z. Graph contrastive
learning automated. In International Conference on Ma-
chine Learning, pp. 12121–12132. PMLR, 2021.

Yu, X., Tang, L., Rao, Y., et al. Point-bert: Pre-training 3d


point cloud transformers with masked point modeling. In
CVPR, pp. 19313–19322, 2022.

Zang, C. and Wang, F. Moflow: an invertible flow model for


generating molecular graphs. In Proceedings of the 26th
ACM SIGKDD international conference on knowledge
discovery & data mining, pp. 617–626, 2020.

Zeng, J. and Xie, P. Contrastive self-supervised learning


for graph classification. In AAAI, volume 35, pp. 10824–
10832, 2021.

Zhang, B. and Sennrich, R. Root mean square layer nor-


malization. Advances in Neural Information Processing
Systems, 32, 2019.

Zhang, J., Luo, D., and Wei, H. Mixupexplainer: General-


izing explanations for graph neural networks with data
augmentation. In Proceedings of the 29th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining,
pp. 3286–3296, 2023.

Zhang, Z., Liu, Q., Wang, H., Lu, C., and Lee, C.-K. Motif-
based graph self-supervised learning for molecular prop-
erty prediction. Advances in Neural Information Process-
ing Systems, 34:15870–15882, 2021.

14
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

A. Representation
When applying graph mixup, the training samples are drawn from the original data with probability pself and from mixed
data with probability (1 − pself ). The mixup hyperparameter α and pself are shown in Table 7.

Tox21 ToxCast Sider HIV BBBP BACE ESOL FreeSolv LIPO


batch size 16 16 16 64 128 16 16 64 16
lr 1e-5 5e-5 1e-4 1e-4 5e-4 1e-5 1e-4 1e-4 5e-5
dropout 0.0 0.0 0.0 0.0 0.1 or 0.3 0.0 0.1 0.1 0.0
epoch 50 50 50 50 50 or 100 50 50 50 50
α for mixup 0.5 0.1 0.5 0.5 0.5 0.5 0.5 0.5 0.1
pself for mixup 0.7 0.7 0.7 0.5 0.5 0.7 0.7 0.9 0.7

Table 7: Hyperparameters for property prediction.

15
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

B. Generation
B.1. Few-Shots Generation
We introduce metrics (Bagal et al., 2021) of few-shots generation as follows:

• Validity: the fraction of a generated molecules that are valid. We use RDkit for validity check of molecules. Validity
measures how well the model has learned the SMILES grammar and the valency of atoms.
• Uniqueness: the fraction of valid generated molecules that are unique. Low uniqueness highlights repetitive molecule
generation and a low level of distribution learning by the model.
• Novelty: the fraction of valid unique generated molecules that are not in the training set. Low novelty is a sign of
overfitting. We do not want the model to memorize the training data.
• Internal Diversity (IntDivp ): measures the diversity of the generated molecules, which is a metric specially designed
to check for mode collapse or whether the model keeps generating similar structures. This uses the power (p) mean of
the Tanimoto similarity (T ) between the fingerprints of all pairs of molecules (s1, s2) in the generated set (S).

s
1 X
InvDivp (S) = 1 − p T (s1, s2)p (14)
|S|2
s1,s2∈S

B.2. Conditional Generation


We provide a detailed description of the conditions used for conditional generation as follows:

• QED (Quantitative Estimate of Drug-likeness): a measure that quantifies the “drug-likeness” of a molecule based on
its pharmacokinetic profile, ranging from 0 to 1.
• SA (Synthetic Accessibility): a score that predicts the difficulty of synthesizing a molecule based on multiple factors.
Lower SA scores indicate easier synthesis.
• logP (Partition Coefficient): a key parameter in studies of drug absorption and distribution in the body that measuring
a molecule’s hydrophobicity.
• Scaffold: the core structure of a molecule, which typically includes rings and the atoms that connect them. It provides
a framework upon which different functional groups can be added to create new molecules.

In order to integrate conditional information into our model, we set aside an additional 100M molecules from the ZINC
database for finetuning, which we denote as the dataset DG . For each molecule G ∈ DG , we compute its property values
vQED , vSA and vlogP and normalize them to 0 mean and 1.0 variance, yielding v̄QED , v̄SA and v̄logP .
The Graph2Seq model takes all properties and scaffolds as inputs and transforms them into the Graph Word sequence
W = [w1 , w2 , · · · , wk ]. The additional property and scaffold information enables Graph2Seq to encode Graph Words with
conditions. The Graph Words are then subsequently decoded by GraphGPT following the same implementation in Section
3.3. In summary, the inputs of the Graph2Seq encoder comprises:

1. Graph Word Prompts [[GW 1], · · · , [GW k]], which are identical to the word prompts discussed in Section 3.2.
2. Property Token Sequence [[QED], [SA], [logP]], which is encoded from the normalized property values v̄QED ,
v̄SA and v̄logP .
3. Scaffold Flexible Token Sequence FTSeqScaf , representing the sequence of the scaffold for the molecule.

For the sake of comparison, we followed Bagal et al. (2021) and trained a MolGPT model on the GuacaMol dataset (Brown
et al., 2019) using QED, SA, logP, and scaffolds as conditions for 10 epochs. We compare the conditional generation ability
by measuring the MAD (Mean Absolute Deviation), SD (Standard Deviation), validity and uniqueness. Table 8 presents the
full results, underscoring the superior control of GraphGPT-1W-C over molecular properties.

16
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

Pretrain Metric QED=0.5 QED=0.7 QED=0.9 SA=0.7 SA=0.8 SA=0.9 logP=0.0 logP=2.0 logP=4.0 Avg.
MAD ↓ 0.081 0.082 0.097 0.024 0.019 0.013 0.304 0.239 0.286 0.127
MolGPT

% SD ↓ 0.065 0.066 0.092 0.022 0.016 0.013 0.295 0.232 0.258 0.118
Validity ↑ 0.985 0.985 0.984 0.975 0.988 0.995 0.982 0.983 0.982 0.984
MAD ↓ 0.041 0.031 0.077 0.012 0.028 0.031 0.103 0.189 0.201 0.079
GraphGPT-1W-C

% SD ↓ 0.079 0.077 0.121 0.055 0.062 0.070 0.460 0.656 0.485 0.229
Validity ↑ 0.988 0.995 0.991 0.995 0.991 0.998 0.980 0.992 0.991 0.991
MAD ↓ 0.032 0.033 0.051 0.002 0.009 0.022 0.017 0.190 0.268 0.069
" SD ↓ 0.080 0.075 0.090 0.042 0.037 0.062 0.463 0.701 0.796 0.261
Validity ↑ 0.996 0.998 0.999 0.995 0.999 0.996 0.994 0.990 0.992 0.995

Table 8: Overall comparison between GraphGPT-1W-C and MolGPT on different properties with scaffold SMILES
“c1ccccc1”. “MAD” denotes the Mean Absolute Deviation of the property value in generated molecules compared to the
oracle value. “SD” denotes the Standard Deviation of the generated property.

QED SA
12
0.5 30 0.7
0.7 0.8
10 0.9
25 0.9
Dataset Dataset
8
20
Density

Density

6 15

4 10

2 5

0 0
0.2 0.4 0.6 0.8 1.0 1.2 0.6 0.7 0.8 0.9 1.0
(a) QED (b) SA

1.4 logP

0.0

1.2 2.0

4.0

1.0 Dataset
Density

0.8

0.6

0.4

0.2

0.0

2 0 2 4 6

(c) logP
Figure 6: Property distribution of generated molecules on different conditions using GraphGPT-1W-C.

17
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

C. Graph Words
C.1. Clustering
The efficacy of the Graph2Seq encoder hinges on its ability to effectively map Non-Euclidean graphs into Euclidean latent
features in a structured manner. To investigate this, we visualize the latent Graph Words space using sampled features,
encoding 32,768 molecules with Graph2Seq-1W and employing HDBSCAN (McInnes & Healy, 2017) for clustering the
Graph Words.
Figures 7 and 8 respectively illustrate the clustering results and the molecules within each cluster. An intriguing observation
emerges from these results: the Graph2Seq model exhibits a propensity to cluster molecules with similar properties (e.g.,
identical functional groups in clusters 0, 1, 4, 5; similar structures in clusters 2, 3, 7; or similar Halogen atoms in cluster 3)
within the latent Graph Words space. This insight could potentially inform and inspire future research.

6
0

5
1
4

Figure 7: UMAP (McInnes et al., 2018) visualization of the clustering result on the Graph Words of Graph2Seq-1W.

18
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

(a) Cluster 0

(b) Cluster 1

(c) Cluster 2

(d) Cluster 3

(e) Cluster 4

(f) Cluster 5

(g) Cluster 6

(h)Cluster 7

Figure 8: Visualization of the molecules in each cluster.

19
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

C.2. Graph Translation


Graph Interpolation. In exploit of the Euclidean representation of graphs, we explore the continuity of the latent Graph
Words using interpolation. Consider a source molecule Gs and a target molecule Gt . We utilize Graph2Seq to encode them
into Graph Words, represented as Ws and Wt , respectively. We then proceed to conduct a linear interpolation between these
two Graph Words, resulting in a series of interpolated Graph Words: Wα′ 1 , Wα′ 2 , . . . , Wα′ k , where each interpolated Graph
Word is computed as Wα′ i = (1 − αi )Ws + αi Wt . These interpolated Graph Words are subsequently decoded back into
molecules using GraphGPT.
The interpolation results are depicted in Figure 9. We observe a smooth transition from the source to the target molecule,
which demonstrates the model’s ability to capture and traverse the continuous latent space of molecular structures effectively.
This capability could potentially be exploited for tasks such as molecular optimization and drug discovery.

0 (Source) 0.231 0.419 0.606 0.711 0.789 1 (Target)

(a)

0 (Source) 0.395 0.476 0.488 0.538 0.615 1 (Target)

(b)

0 (Source) 0.263 0.396 0.530 0.603 0.654 1 (Target)

(c)

0 (Source) 0.356 0.459 0.510 0.526 0.614 1 (Target)

(d)

Figure 9: Graph interpolation results with different source and target molecules using GraphsGPT-1W. The numbers denote
the values of α for corresponding results.

Graph Hybridization. With Graph2Seq, a graph G can be transformed into a fixed-length Graph Word sequence
W = [w1 , · · · , wk ], where each Graph Word is expected to encapsulate distinct semantic information. We investigate the
representation of Graph Words by hybridizing them among different inputs.
Specifically, consider a source molecule Gs and a target molecule Gt , along with their Graph Words Ws = [ws1 , · · · , wsk ]
and Wt = [wt1 , · · · , wtk ]. Given the indices set I ,we replace a subset of source Graph Words with the corresponding
target Graph Words wsi ← wti , i ∈ I, yielding the hybrid Graph Words Wh = [wh1 , · · · , whk ], where:

(
wti , i∈I
wh = . (15)
wsi , i∈
/I

20
A Graph is Worth K Words: Euclideanizing Graph using Pure Transformer

We then decode Wh using GraphGPT back into the graph and observe the changes on the molecules. The results are depicted
in Figure 10. From these results, we observe that hybridizing specific Graph Words can lead to the introduction of certain
features from the target molecule into the source molecule, such as the Sulfhydryl functional group. This suggests that
Graph Words could potentially be used as a tool for manipulating specific features in molecular structures, which could have
significant implications for molecular design and optimization tasks.

Source Hybrid-4 Hybrid-4-5 Hybrid-4-5-7

Source Hybrid-4 Hybrid-4-5 Hybrid-4-5-7 Target

Source Hybrid-4 Hybrid-4-5 Hybrid-4-5-7

Figure 10: Hybridization results of Graph Words. The figure shows the changes in the source molecule after hybridizing
specific Graph Words from the target molecule. We use GraphsGPT-8W which has 8 Graph Words in total.

21

You might also like