0% found this document useful (0 votes)
33 views

Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer

FlowerFormer is a graph transformer model that incorporates information flows within neural architectures for improved performance prediction. It includes two key modules: 1) a flow encoding module that performs bidirectional asynchronous message passing to mimic forward and backward propagation; and 2) a flow-aware global attention module that applies attention with masking based on flow dependencies. Experiments on five benchmark datasets show FlowerFormer outperforms other neural architecture encoding methods.

Uploaded by

larrylynnmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer

FlowerFormer is a graph transformer model that incorporates information flows within neural architectures for improved performance prediction. It includes two key modules: 1) a flow encoding module that performs bidirectional asynchronous message passing to mimic forward and backward propagation; and 2) a flow-aware global attention module that applies attention with masking based on flow dependencies. Experiments on five benchmark datasets show FlowerFormer outperforms other neural architecture encoding methods.

Uploaded by

larrylynnmail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

FlowerFormer: Empowering Neural Architecture Encoding

using a Flow-aware Graph Transformer

Dongyeong Hwang Hyunju Kim Sunwoo Kim Kijung Shin


Kim Jaechul Graduate School of AI, KAIST, Seoul, Republic of Korea
{dy.hwang, hyunju.kim, kswoo97, kijungs}@kaist.ac.kr
arXiv:2403.12821v2 [cs.LG] 21 Mar 2024

Abstract have been proposed since obtaining an accurate representa-


tion of each architecture plays a crucial role in the estima-
The success of a specific neural network architecture is tion process. Their focus has mainly revolved around (a)
closely tied to the dataset and task it tackles; there is no one- transforming input neural architectures to appropriate data
size-fits-all solution. Thus, considerable efforts have been structures [25, 47] and (b) applying representation-learning
made to quickly and accurately estimate the performances models to the transformed structures [5, 48].
of neural architectures, without full training or evaluation, Some have treated neural architectures as graphs and ap-
for given tasks and datasets. Neural architecture encod- plied graph representation learning. They, however, share
ing has played a crucial role in the estimation, and graph- some limitations. For instance, their basic message-passing
based methods, which treat an architecture as a graph, mechanisms oversimplify neural-architecture characteris-
have shown prominent performance. For enhanced rep- tics [41, 46] and may suffer from over-smoothing [34], over-
resentation learning of neural architectures, we introduce squashing [2], or limited expressiveness [37].
F LOWER F ORMER, a powerful graph transformer that in- Graph Transformers (GTs), when incorporated with ade-
corporates the information flows within a neural architec- quate information, are recognized for enhancing basic mes-
ture. F LOWER F ORMER consists of two key components: (a) sage passing, making them effective in various graph clas-
bidirectional asynchronous message passing, inspired by sification [16, 52] and regression [7, 27] tasks. One strength
the flows; (b) global attention built on flow-based masking. of GTs lies in their global attention mechanisms [44], where
Our extensive experiments demonstrate the superiority of all nodes in an input graph contribute directly to forming the
F LOWER F ORMER over existing neural encoding methods, representation for each individual node.
and its effectiveness extends beyond computer vision models However, without integrating relevant topological or ex-
to include graph neural networks and auto speech recogni- ternal information of input graphs, the relevance of atten-
tion models. Our code is available at http://github tion scores, and thus the effectiveness of GTs, might be im-
.com/y0ngjaenius/CVPR2024_FLOW ERFormer. paired. For example, Niu et al. [33] showed the essentiality
of using motif-based spatial embedding to incorporate the
characteristics of molecule graphs into GTs.
1. Introduction In this work, we propose F LOWER F ORMER (Flow-
aware graph transformer), a GT model specialized in cap-
While deep learning models have demonstrated their effi- turing information flows within neural architectures, as il-
cacy across various applications, the performance of a spe- lustrated in Fig. 1. The information flows of a neural archi-
cific neural architecture heavily depends on specific down- tecture contain the characteristics of both forward and back-
stream tasks and datasets employed. As a result, numerous ward propagations of the architecture, and thus describe
neural architectures have been developed [14, 15]. its fundamental properties. F LOWER F ORMER includes two
In response to this dependency, significant efforts have core modules: the flow encode module and the flow-aware
been made to rapidly and accurately predict the perfor- global attention module. The former conducts bidirectional
mances of neural architectures for given tasks and datasets. asynchronous message passing, imitating the forward and
This endeavor is crucial because exhaustively training backward propagations within the input neural architecture.
and/or evaluating many candidate neural architectures is an The latter applies global attention with masking schemes
expensive process. To this end, researchers have primarily based on the flow-based dependencies between nodes.
employed machine learning techniques [6, 24]. Our extensive experiments on neural architecture perfor-
Especially, various neural architecture encoding methods mance prediction, conducted using five benchmark datasets,
3 1 4 0 One popular class of approaches is graph-based, mod-
0 4 1 3 1x1 out LOSS eling neural architectures as graphs and using graph neu-
in 3x3 ral networks [18] for representation learning. These ap-
forward pass
mp backpropagation proaches have also introduced topology-based graph sim-
image
2 2 ilarity and operation-specific embeddings [4, 8].
Figure 1. Information flows within an example neural architec- Another significant approach aims to obtain representa-
ture from the NAS-Bench-101 benchmark [51]. The architecture tions that mimic the forward and/or backward passes within
is represented as a directed graph where each node corresponds to
neural architectures. For instance, GATES [31] updates op-
an operation, and the topological structure of the graph encodes the
eration embeddings by mimicking the application of opera-
sequence in which these operations are performed. For instance,
image
the ‘1×1’ (convolution) operation is executed only after the ‘3×3’ tions to information (which is also represented as a vector)
(convolution) and ‘mp’ (max pooling) operations have been com- and thus effectively replicating the forward-pass of convo-
pleted. The forward pass, depicted by blue arrows, is followed by lution operations. Another method, TA-GATES [32], simu-
the backpropagation of the loss, depicted by orange arrows. The lates an iterative process involving both forward and back-
number displayed above each node indicates the processing order ward passes, with specialized handling for specific opera-
within each flow. tions, e.g., skip-connections. However, these methods focus
validate the superiority of F LOWER F ORMER over state-of- on flows only at a local level, by simulating a series of local
the-art neural encoding models [32, 50]. The results high- operations, and may overlook a global-level perspective.
light the effectiveness of incorporating flows into GTs. Our Transformer-based models [23, 49] are capable of cap-
contributions are summarized as follows: turing global-level perspectives through attention mecha-
nisms. NAR-Former [50], a multi-stage fusion transformer,
• We propose F LOWER F ORMER, a flow-aware GT-based
is one of the state-of-the-art methods for predicting neural
neural architecture encoding model. To our best knowl-
architecture performance. They (1) represent a neural archi-
edge, F LOWER F ORMER is the first GT model specifically
tecture as a sequence of operations to employ a transformer-
designed to capture flows.
based model and (2) leverage multiple valid sequences from
• F LOWER F ORMER outperforms six baseline architectures,
the same architecture for augmentation.
including the most recent ones [32, 50], by a substan-
tial margin across three benchmark datasets in the com- In this work, we unify all three dimensions—graph
puter vision domain. Specifically, in predicting the per- learning, flow modeling, and global attention—by introduc-
formance of neural architectures, it outperforms the top- ing a novel flow-aware GT, marking the first instance of
performing baseline method by a margin of up to 4.38% such integration to the best of our knowledge.
in Kendall’s Tau. Additionally, through ablation studies,
we justify the design choices made in F LOWER F ORMER. 2.2. Graph transformers (GTs)
• Beyond computer vision neural architectures, F LOWER -
Graph transformers (GTs) [11, 16, 19, 37, 42, 52] ap-
F ORMER also excels at performance prediction for graph
ply global (i.e., graph-level) attention between all node
neural networks and auto speech recognition architec-
pairs. Recently, GTs show remarkable performance in vari-
tures. In the benchmarks for these architectures, F LOW-
ous graph-level tasks, including molecular property predic-
ER F ORMER achieves performance gains of up to 4.41%
tion [17, 38], image classification [30, 55], and human in-
in Kendall’s Tau over baseline models.
teraction recognition [35].
Our code is available at http://github.com/y0n To further improve their effectiveness, global attention is
gjaenius/CVPR2024_FLOWERFormer. often supplemented with topological and/or external infor-
mation. The information includes eigenvectors of adjacency
2. Related work and Laplacian matrices [19, 42] and pair-wise node similar-
ity derived from shortest paths, diffusion kernels, random
In this section, we briefly review related studies in neural walks, etc [19, 29, 52].
architecture encoding and graph transformers (GTs).
Some GTs are tailored for specific types of graphs.
2.1. Neural architecture encoding For molecular graphs, where motifs play key roles, Niu
et al. [33] employ motif-based spatial embeddings in a GT.
Neural architecture encoding [21, 24, 25, 45, 47], which DAGFormer [26] is designed for directed acyclic graphs
aims to learn representations of neural architectures, has (DAGs) and incorporates depth-based positional encodings
gained considerable attention due to its significant down- and reachability-based attention. Note that DAGFormer
stream tasks, such as performance prediction (i.e., the pre- is designed for general DAGs, and it is not optimized
diction of task- and data-specific performance for given ar- for encoding neural architectures, especially in capturing
chitectures without full training or evaluation). architecture-specific flows.
input Add 𝑯𝑯𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇
Flow Encode
& output
Module
Norm Add
Feed READ
Forward
&
OUT Regressor �
𝒚𝒚
Add Norm
Flow-aware Global architecture
&
Attention Module performance
neural architecture Norm 𝑯𝑯
𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 architecture
× 𝐋𝐋 embedding

Figure 2. Overview of proposed F LOWER F ORMER, which contains two key modules in each of its layers: the flow encode module and the
flow-aware global attention module. The flow encode module performs bidirectional asynchronous message passing, inspired by forward
and backward passes, to produce a node embedding matrix Hflow . The flow-aware global attention module computes attention with a flow-
based masking scheme to yield another node embedding matrix Hglobal . These two embedding matrices, Hflow and Hglobal , are combined and
then projected to produce updated node embeddings at each layer. This process is iterated over L layers, and the output node embeddings
are aggregated to form the final architecture embedding, which is fed into a regressor for performance prediction.
0
3. Proposed method: F LOWER F ORMER in node operation
0 1 2 3 4 in 1x1 3x3 mp out
In this section, we present F LOWER F ORMER (Flow-aware 1
3x3 0 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎
graph transformer), a graph transformer model designed 1 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎
2
to capture information flows within an input neural archi- mp 2 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎
3
tecture. First, we provide the motivation behind F LOWER - 1x1 3 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎
F ORMER in Sec. 3.1. Then, we describe how an input neural 4 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏
4
architecture is represented as a graph in Sec. 3.2. After that,
out 𝐀𝐀 𝐗𝐗
we elaborate on how F LOWER F ORMER learns the represen-
tation of the neural architecture graph. Specifically, we de- Figure 3. An example neural architecture from the NAS-Bench-
scribe two core modules of F LOWER F ORMER, collectively 101 dataset, represented as a directed acyclic graph (DAG), and
referred to as F LOWER, in Sec. 3.3. Lastly, we present the its adjacency matrix A. Each column of the node feature matrix
overall framework (refer to Fig. 2) in Sec. 3.4. X corresponds to a specific operation, and each row in X is a
one-hot vector indicating the type of operation associated with the
3.1. Motivation of capturing information flows corresponding node.

Despite the remarkable success of Graph Transformers example can be found on the left-hand side of Figure 3.
(GTs) in various graph-level tasks, including graph classifi- We denote the graph representation of a neural archi-
cation [16, 52] and regression [7, 27], their application for tecture by G = (A, X), a tuple of an adjacency matrix
encoding neural architectures has received relatively limited A ∈ {0, 1}N ×N and a node (i.e., operation) feature ma-
attention. Existing applications of GTs suggest that addi- trix X ∈ {0, 1}N ×D , where N is the number of nodes and
tional design choices for accurately capturing the underly- D is the number of operations. The adjacency matrix en-
ing characteristics of input graphs (on top of global attention codes direct connections between node pairs in a graph. Its
mechanism between all pairs of nodes) are essential for the binary entries indicate whether a directional edge exists be-
effectiveness of GTs. Refer to Sec. 2.2 for some examples. tween each pair of nodes. Specifically, the (i, j)-th entry of
In this work, we focus on a crucial aspect: capturing A is set to 1 if there is a directed edge from the i-th node
information flows within neural architectures (i.e., input (denoted as vi ) to the j-th node (denoted as vj ), and 0 other-
graphs). Information flows include both the forward pass wise. Each node is associated with a one-hot feature vector
of data and the backpropagation of gradients. Hence, it is representing its corresponding operation, and these vectors
essential to capture information flows for incorporating how are stacked vertically to form the node feature matrix X.
neural architectures are trained and conduct inference into Refer to Fig. 3 for an example.
their embeddings (i.e., the encoded neural architectures). With our general input modeling scheme, F LOWER -
F ORMER is readily applicable to different domains and neu-
3.2. Input modeling
ral architectures without such additional modelings or steps.
We represent a given neural architecture as a directed By contrast, state-of-the-art neural encoding methods often
acyclic graph (DAG), with each node representing an oper- rely on complex modelings and/or preprocessing steps, such
ation (e.g., pooling or convolution). Each directional edge as the specialized treatment of specific operations [32] and
between two nodes indicates the information flow between isomorphic augmentations [50] (refer to Sec. 2.1). The em-
the corresponding operations, aligning with the direction of pirical superiority of F LOWER F ORMER (refer to Sec. 4) de-
data propagation during the forward pass. An illustrative spite its straightforward (yet elegant) input modeling is at-
Algorithm 1: Flow encode module 3 𝑻𝑻𝑮𝑮𝟏𝟏 = 𝟏𝟏, 𝟐𝟐
Input: (1) G = (A, X): an input neural architecture 1 𝑻𝑻𝑮𝑮𝟐𝟐 = {𝟑𝟑, 𝟒𝟒, 𝟓𝟓}
(2) H: an input node embedding matrix 4 6 7 𝑻𝑻𝑮𝑮𝟑𝟑 = {𝟔𝟔}
Output: H: updated node embedding matrix 2
5 𝑻𝑻𝑮𝑮𝟒𝟒 = {𝟕𝟕}
1 /∗ step 1. topological sorting ∗ /
2 T G ← topological generations of G Figure 4. Example topological generations. Nodes 1 and 2 are
3 /∗ step 2. asynchronous forward message passing ∗ / devoid of incoming edges, and thus they constitute the first topo-
4 for k = 1, . . . , |T G | do logical generation T1G . Upon removal of nodes 1 and 2, nodes 3,
5 for vj ∈ TkG do 4, and 5 no longer have incoming edges, and thus they compose
6 hj ← Comb(hj , Agg{me (hj , hi ) : Aij = 1}) the second generation T2G . Subsequently, nodes 6 and 7 form the
third and fourth generations, respectively.
7 /∗ step 3. asynchronous backward message passing ∗ /
8 for k = |T G |, . . . , 1 do tion, denoted as T1G , comprises the nodes without incom-
9 for vj ∈ TkG do ing edges in G. Then, for each k > 1, the k-th topolog-
10 hj ← Comb(hj , Agg{me (hj , hi ) : Aji = 1}) ical generation TkG comprises the nodes without incoming
11 return H edges when all preceding generations are removed from G.
The set of non-empty topological generations is denoted as
tributed to our novel flow-aware GT architecture, which is T G := {T1G , . . . , T|T
G
G | }. Refer to Fig. 4 for an example.
described in the following subsection. These topological generations are closely related to the
data flow within a neural architecture. For the operations
3.3. F LOWER layers
(i.e., nodes) in each generation to be executed, all operations
In this section, we introduce F LOWER layers, the basic in the preceding generations need to be complete. Con-
units of F LOWER F ORMER. A F LOWER layer consists of versely, during the process of backpropagation, gradients
two core components: the flow encode module and the flow- flow from subsequent generations to preceding generations.
aware global attention module. The flow encode module Forward message passing (Line 4-Line 6): During the
is a message-passing neural network (MPNN) that asyn- forward message passing step, node embeddings are up-
chronously passes messages in the forward and then the dated asynchronously, following the order of the topolog-
backward orders. The flow-aware global attention mod- ical generations, akin to the forward pass within neural ar-
ule is a self-attention module based on a flow-aware mask- chitectures. For each node vj , its embedding hj (i.e., the j-
ing scheme. The outputs of the flow encode module and th row vector of H) is updated by the following three steps:
the flow-aware global attention module are node embedding (1) computing the message me (hj , hi ) for each incoming
(l) (l)
matrices, denoted as Hf low ∈ RN ×d and Hglobal ∈ RN ×d , neighbor vi , (2) aggregating these messages, and (3) com-
respectively, for the l-th F LOWER layer. Below, we provide bining the result with the current embedding hj (Line 6).
a detailed explanation of each module. Note that the embeddings of all incoming neighbors, which
belong to preceding generations, have already been updated
by the time message calculation occurs. Also note that this
3.3.1 Flow encode module
differs from conventional synchronous graph message pass-
As discussed in Sec. 3.1, we aim to enable a GT to cap- ing, where all node embeddings are updated simultaneously
ture the crucial aspect of neural architectures—information based on their input embeddings.
flows. To this end, the flow encode module conducts both In our implementation, we use the sum aggregation as
asynchronous forward and backward message passing, re- the Agg function. As me and Comb, we adopt the message
sembling the forward pass (i.e., inference) and backpropa- function and the combine operator used in [43], as follows:
gation (i.e., training) of neural architectures, respectively.
These message-passing procedures are carried out in the
(reversed) topological order in the input neural architecture & m_e(h_j, h_i) = \operatorname {softmax}(w^{\top }_1 h_j + w^{\top }_2 h_i) h_i, \label {eq:message}\\ & \operatorname {msg}_j = \Sigma _{i: A_{ij}=1} m_e(h_j, h_i), \label {eq:agg}\\ & \operatorname {Comb}(h_j, \operatorname {msg}_j) = \operatorname {GRU}(h_j, \operatorname {msg}_j), \label {eq:comb}
graph, leading to updated node embeddings.
Pseudocode of the flow encode module is presented (3)
in Algorithm 1. It includes topological sorting, forward
message passing, and backward message passing, in order, where w1 ∈ Rd and w2 ∈ Rd are learnable parameters.
and each of these components is described below. Backward message passing (Line 8-Line 10): After the
Topological sorting (Line 2): The first step is to divide forward message passing step, we further update node em-
nodes (i.e., operations) into topological generations. Recall beddings through backward message passing, which re-
that neural-architecture graphs are directed acyclic graphs sembles the process of backpropagation. This aligns with
(DAGs). Given a DAG G, its first topological genera- the standard practice in neural architecture training, where
𝑻𝑻𝑮𝑮𝟐𝟐 1 2 3 4 5 6 7
𝑻𝑻𝑮𝑮𝟏𝟏 1
𝑻𝑻𝑮𝑮𝟑𝟑 2 2
0 = do not attend
𝐏𝐏𝟏𝟏 3
1 3 5 7 4 1 = attend
5
𝐏𝐏𝟐𝟐 6
4 6
7 =M
Forward Message Passing
Figure 6. An example mask matrix M . Node 6 attends exclusively
Backward Message Passing to nodes that appear in any path involving the node (P1 and P2 ).
Nodes 1, 3, and 7 appear in P1 , and nodes 1, 4, and 7 appear in
P2 ; thus node 6 attends only to 1, 3, 4, and 7, as indicated by M .
Specifically, given the input node-embedding matrix
𝑻𝑻𝑮𝑮𝟑𝟑 H (ℓ−1) ∈ RN ×d and the mask matrix M , the flow-
𝑻𝑻𝑮𝑮𝟏𝟏
𝑻𝑻𝑮𝑮𝟐𝟐 aware global attention module computes its output node-
(ℓ)
= output node embedding embedding matrix Hglobal ∈ RN ×d as follows:
Figure 5. Flow encode module. During forward message passing, H^{(\ell )}_{global} = \operatorname {MMHA}(H^{(\ell -1)},H^{(\ell -1)},H^{(\ell -1)},M).~\label {eq:global} (5)
node embeddings are updated following the order of topological
generations. Conversely, during backward message passing, node Here, MMHA is the Masked Multi-Head Attention module:
embeddings are updated in the reverse order of the generations.
\operatorname {MMHA}(Q, K, V, M)=\operatorname {Concat}(\operatorname {head}_1, \dots , \operatorname {head}_s)W^0,~\label {eq:mmha}
backpropagation typically occurs after the forward pass for
loss computation. where W 0 ∈ Rsdv ×d is the learnable projection matrix, s is
During the backward message passing step, node embed- the number of heads, and
dings are updated asynchronously, following the reverse or- &\operatorname {head}_i = \operatorname {Attn}(QW_i^Q, KW_i^K, VW_i^V, M),\\ &\operatorname {Attn}(Q, K, V, M)=\left (M\odot \operatorname {Softmax}\left (\frac {QK^T}{\sqrt {d_k}}\right )\right )V.
der of the topological generations. For each node vj , the
messages from its outgoing neighbors (rather than incom-
ing neighbors) are computed and then aggregated (Line 10).
The other details remain consistent with those of the for- Here, ⊙ is element-wise multiplication; and WiQ ∈ Rd×dk ,
ward message passing in to Eq. (1), Eq. (2), and Eq. (3). WiK ∈ Rd×dk , and WiV ∈ Rd×dv denote i-th head’s learn-
Outputs: We denote the output node-embedding matrix able query, key, and value projection matrices, respectively.
of the flow encode module in the ℓ-th F LOWER layer, as We adhere to the condition dk = dv = d/s for every head.
(ℓ)
Hf low ∈ RN ×d . That is,
3.4. Overall framework: F LOWER F ORMER
H^{(\ell )}_{flow} &= \operatorname {FlowEncoder}(G, H^{(\ell -1)}). \label {eq:flow} (4)
The overall framework of F LOWER F ORMER is illustrated
Here, H (ℓ−1) ∈ RN ×d is the input node-embedding matrix in Fig. 2. For each ℓ, it derives the output node embedding
(ℓ) (ℓ)
obtained in layer-(ℓ − 1), the previous layer (Eq. (6)). matrix H (ℓ) from Hf low (Eq. (4)) and Hglobal (Eq. (5)) for
the ℓ-th layer as follows:
3.3.2 Flow-aware global attention module
\label {eq:H_ell} H^{(\ell )} = \operatorname {FeedForward}(H^{(\ell )}_{flow} + H^{(\ell )}_{global}) (6)
The flow-aware global attention module is designed to In our implementation, we employ a 2-layer MLP with
capture graph-level (i.e., architecture-level) characteristics, ReLU activation [1] as the feedforward network. As shown
complementing the flow encode module which primarily fo- in Fig. 2, note that we incorporate skip-connection and
cuses on local-level flows between directly connected oper- batch normalization in every module.
ations. To this end, we employ a global attention mech- The output H (ℓ) is used as the input of the next F LOWER
anism of GTs; moreover, to accurately reflect the flows layer, and for the first layer, we utilize a projected input
within architectures, we restrict attention scores to be com- feature matrix as the input by multiplying X with a learn-
puted only between nodes connected by at least one path able projection matrix P ∈ RD×d , i.e., H (0) = XP . Each
of the flows. Specifically, we employ a masking strat- F LOWER layer has a separate set of learnable parameters.
egy [26, 49] with a mask matrix M ∈ RN ×N defined as The node embeddings in the output H (L) , where L rep-
follows (refer to Fig. 6 for an example of M ): resents the total number of F LOWER layers, are aggre-

1 if vi lies on any directed path from vj
 gated to drive the final embedding zG of the input neural-
Mij = or vj lies on any directed path from vi , architecture graph G as follows:


0 otherwise. z_G = \operatorname {READOUT}(H^{(L)}), ~\label {eq:overall:readout}
For aggregation, we use mean pooling as the readout func- Table 1. Basic information about the benchmark datasets we used.
tion in our implementation. The sizes of the training and test splits used in [32] are reported.
Refer to Sec. 4.1.3 for details about training and test splits.
Application to performance prediction: The architecture
embedding zG is used for downstream tasks. For example, Dataset Domain # trains # tests
for performance prediction, it may serve as input to a regres- NAS-Bench-101 7,290 7,290
NAS-Bench-201 Computer vision 7,813 7,812
sor that outputs the estimated performance ŷG as follows:
NAS-Bench-301 5,896 51,072
NAS-Bench-ASR Speech recognition 4,121 4,121
\hat {y}_G = \operatorname {Regressor}(z_G).~\label {eq:overall:predict} NAS-Bench-Graph Graph learning 13,103 13,103

In Sec. 4, we employ a fully connected layer as the regressor • Graph learning: We include NAS-Bench-Graph [36],
and utilize the following margin ranking loss for training which consists of graph neural networks.
both F LOWER F ORMER and the regressor: Refer to Tab. 1, for basic statistics, and the supplementary
material, for details including our preprocessing methods.
\mathcal {L} = \sum _{(i,j): y_i>y_j}\max (0, \text {margin} - (\hat {y}_i - \hat {y}_j)), (7)
4.1.2 Baseline methods

where yi and yj are the ground-truth performances of archi- We utilize six baseline approaches, categorized as follows:
tectures Gi and Gj , respectively. For each pair of architec- (a) Graph neural networks: GatedGCN [3] and directed
tures Gi and Gj in the training set such that Gi outperforms acyclic graph neural network (DAGNN) [43], (b) Graph
Gj (i.e., yi > yj ), the loss encourages ŷi to be greater than transformers: GraphGPS [37] and DAGFormer [26], and
ŷj by at least a specified margin. Such designs for loss func- (c) Neural architecture encoders: TA-GATES [32] and
tions are commonly employed when it is important to make NAR-Former [50], which are state-of-the-art methods for
relative comparisons among instances (in our case, we com- neural architecture performance prediction. We use the of-
pare neural architectures to recommend better ones). ficial implementations of these methods, and the links can
be found in the supplementary material.
4. Experiments
4.1.3 Training and Evaluation protocol
In this section, we review our experiments. For evalua-
tion, we focus on the downstream task of predicting the For model training and evaluation, we follow the setting in
performance of neural architectures. In Sec. 4.2, we com- [32], including their training and test splits. We use a sub-
pare the accuracies of F LOWER F ORMER and six baseline set of the training split as the actual training set, varying the
methods, including two state-of-the-art methods (spec., TA- size of this subset: 1%, 5%, 10%, and 50% of the training
GATES [32] and NAR-Former [50]) using three perfor- split. We use the first 40 architectures in the test split as a
mance prediction benchmark datasets composed of com- validation set for hyperparameter tuning and early stopping,
puter vision model architectures. In Sec. 4.3, we conduct the remaining ones in the split as a test set. In each set-
an ablation study to validate each component of F LOWER - ting, we perform 9 trials using the three different splits and
F ORMER. In Sec. 4.4, we extend our evaluation to datasets three different random seeds, and we report the mean and
consisting of graph neural networks and auto speech recog- standard deviation across these trials. As accuracy metrics,
nition models. In Sec. 4.5, we examine the training and we use Kendall’s Tau [40] to assess overall performance
inference speed of F LOWER F ORMER. and Precision@K (which measures the proportion of cor-
rectly predicted top-K architectures among the true top-K
4.1. Experimental settings outperforming architectures) for the performance of iden-
Below, we provide an overview of our experimental setup. tifying the best architectures. Note that these metrics are
commonly employed in the field of neural architecture en-
coding [31, 32, 50].
4.1.1 Datasets
4.2. Performance on computer vision benchmarks
We evaluate the effectiveness of neural architecture encod-
ing methods using five benchmark datasets designed for In this subsection, we focus on the computer vision bench-
performance prediction, spanning three domains: marks for which we have extensive baseline methods. In
• Computer vision: We use three datasets: NAS-Bench- Tabs. 2 and 3, we report the performance prediction accura-
101 [51, 53], NAS-Bench-201 [10], and NAS-Bench-301 cies of the considered methods using two metrics across dif-
[54]. These datasets contain computer vision models. ferent training instance ratios. Notably, F LOWER F ORMER
• Speech recognition: We employ NAS-Bench-ASR [28], consistently outperforms all baseline methods across all set-
which consists of auto speech recognition architectures. tings in terms of Kendall’s Tau. In terms of Precision@K,
Table 2. Kendall’s Tau (scaled up by a factor of 100, mean and standard deviation over 9 trials) on three datasets: NAS-Bench-101,
NAS-Bench-201, NAS-Bench 301. In each setting, the best performances are highlighted in green. NA: there is no trivial extension of
NAR-Former to NAS-Bench-301, which consists of two-cell architectures. Note that, in every setting, F LOWER F ORMER performs best.
Datasets NAS-Bench-101 NAS-Bench-201 NAS-Bench-301 Avg.
Training portions 1% 5% 10% 50% 1% 5% 10% 50% 1% 5% 10% 50% Rank
GatedGCN [3] 67.4 (6.0) 79.6 (4.1) 82.0 (5.1) 84.8 (5.9) 70.9 (1.8) 84.1 (0.6) 88.6 (0.3) 92.3 (0.1) 61.8 (2.4) 70.0 (0.9) 71.4 (1.0) 72.7 (1.5) 4.91
DAGNN [43] 72.4 (4.5) 82.9 (3.1) 84.4 (4.4) 85.9 (5.3) 75.8 (1.0) 87.5 (0.8) 90.6 (0.2) 92.6 (0.0) 61.5 (1.9) 70.9 (0.5) 73.4 (1.2) 76.1 (1.3) 2.50
GraphGPS [37] 70.6 (4.4) 81.7 (3.8) 83.9 (4.2) 85.9 (5.1) 71.3 (1.3) 82.5 (0.6) 87.8 (0.5) 92.7 (0.1) 59.7 (1.8) 69.3 (0.9) 70.7 (1.2) 73.8 (0.7) 4.75
DAGFormer [26] 73.0 (4.3) 75.6 (5.2) 77.2 (7.0) 80.9 (5.9) 73.0 (73.0) 84.9 (0.8) 88.8 (0.5) 92.7 (0.1) 61.3 (2.0) 70.7 (0.8) 72.1 (0.8) 74.8 (1.0) 3.91
NAR-Former [50] 59.4 (8.8) 72.0 (8.2) 75.5 (10.2) 79.8 (5.9) 62.3 (4.0) 80.7 (1.8) 87.3 (0.7) 88.9 (0.3) NA NA NA NA -
TA-GATES [32] 70.8 (6.0) 82.3 (2.7) 83.9 (3.5) 86.3 (3.9) 77.7 (1.7) 86.3 (0.8) 88.7 (0.3) 91.4 (0.5) 61.3 (1.2) 68.9 (1.6) 71.8 (1.6) 75.4 (0.7) 3.83
F LOWER F ORMER 75.0 (2.9) 86.1 (0.8) 88.1 (0.2) 89.6 (0.1) 80.0 (0.8) 89.8 (0.3) 91.3 (0.2) 92.9 (0.1) 64.2 (1.6) 72.2 (1.0) 73.6 (1.3) 77.5 (0.7) 1.00

Table 3. Precision@K (scaled up by a factor of 100, mean and standard deviation of 9 trials). The proportion of training samples is fixed to
5%. In each setting, the best performances are highlighted in green. NA: there is no trivial extension of NAR-Former to NAS-Bench-301,
which consists of two-cell architectures. Note that, in most cases, F LOWER F ORMER identifies top-k architectures most accurately.
Datasets NAS-Bench-101 (5%) NAS-Bench-201 (5%) NAS-Bench-301 (5%) Avg.
K (for P@Top K%) 1 5 10 50 1 5 10 50 1 5 10 50 Rank
GatedGCN [3] 44.4 (7.6) 65.6 (3.5) 76.2 (2.7) 90.5 (2.1) 42.3 (3.7) 68.5 (3.1) 80.9 (1.9) 94.1 (0.6) 19.1 (4.1) 55.2 (4.5) 71.8 (2.9) 85.4 (0.4) 4.83
DAGNN [43] 41.7 (5.9) 65.4 (4.2) 79.3 (2.9) 92.0 (1.3) 49.6 (6.2) 69.7 (3.0) 83.1 (0.7) 95.3 (0.9) 23.1 (2.1) 58.3 (3.4) 73.1 (1.5) 85.8 (0.4) 2.75
GraphGPS [37] 44.3 (12.2) 67.1 (2.7) 78.7 (1.9) 91.2 (2.0) 49.4 (4.6) 67.9 (4.9) 78.9 (3.4) 93.4 (0.3) 20.6 (2.1) 57.2 (3.8) 73.4 (2.5) 84.8 (0.5) 4.17
DAGFormer [26] 39.4 (7.9) 61.8 (5.6) 71.6 (5.0) 88.2 (2.4) 50.7 (5.8) 70.4 (2.9) 82.5 (2.3) 94.2 (0.5) 20.7 (3.4) 57.6 (3.7) 73.4 (2.5) 85.6 (0.4) 3.83
NAR-Former [50] 47.2 (9.9) 62.6 (7.9) 67.8 (8.4) 85.9 (5.2) 49.5 (6.5) 64.7 (2.0) 69.9 (2.0) 92.3 (1.0) NA NA NA NA -
TA-GATES [32] 44.6 (9.7) 66.6 (4.0) 78.1 (4.6) 91.8 (1.2) 49.4 (3.1) 66.7 (3.3) 78.1 (2.8) 94.8 (0.7) 20.1 (5.0) 56.2 (6.1) 72.4 (3.6) 84.7 (0.7) 4.33
F LOWER F ORMER 46.5 (11.2) 70.0 (1.5) 80.9 (1.8) 92.7 (1.7) 57.0 (5.4) 74.7 (1.8) 85.2 (1.3) 96.9 (0.7) 20.8 (3.7) 58.5 (2.5) 74.7 (2.4) 86.6 (0.6) 1.08

it performs best in 10 out of 12 settings, ranking second in Table 4. Comparison with four variants of F LOWER F ORMER in
the other settings. Two key observations are as follows. terms of Kendall’s Tau, using the same setups as in Tab. 2. In
The suboptimal performance of GraphGPS indicates that each setting, the best performances are highlighted in green. AS:
Asynchronous message passing. FB: Forward-backward message
a graph transformer alone is insufficient in effectively rep-
passing. GA: Global attention. In most cases, F LOWER F ORMER,
resenting neural architectures. Specifically, in terms of which is equipped with all components, outperforms all of its vari-
Kendall’s Tau, the performance gap can be as large as 8.7 ants, thereby validating the effectiveness of each component.
percentage points between F LOWER F ORMER and GPS. We,
Components Training Portions
thus, argue that our incorporation of information flows into Dataset #
AS FB GA 1% 5% 10% 50%
a graph transformer, through the introduction of the flow (1) ✗ ✗ ✔ 41.5(1.7) 42.5 (1.6) 41.1 (2.8) 43.1 (1.4)
(2) ✗ ✔ ✔ 65.5 (8.8) 56.8 (5.0) 53.8 (7.4) 69.6 (11.6)
encode module and the flow-aware global attention module, NB 101 (3) ✔ ✗ ✔ 76.7 (4.1) 83.9 (2.6) 84.6 (4.0 85.6 (5.4)
plays a pivotal role in F LOWER F ORMER’s success. (4) ✔ ✔ ✗ 76.5 (3.0) 83.2 (3.9) 83.9 (5.1) 85.3 (6.3)
The superiority of F LOWER F ORMER over TA-GATES - ✔ ✔ ✔ 75.0 (2.9) 86.1 (0.8) 88.1 (0.2) 89.6 (0.1)
(1) ✗ ✗ ✔ 75.9 (1.2) 86.5 (0.2 88.2 (0.3) 89.7 (0.1)
highlights the importance of the global attention mecha- (2) ✗ ✔ ✔ 73.7 (1.1) 85.6 (0.7) 89.2 (0.5) 92.9 (0.1)
nism. While TA-GATES may capture the information flow NB 201 (3) ✔ ✗ ✔ 76.2 (2.1) 88.6 (0.8) 90.9 (0.1) 92.9 (0.1)
(4) ✔ ✔ ✗ 79.3 (1.2) 89.5 (0.5) 91.1 (0.3) 92.9 (0.3)
at a local-level through its information propagation scheme, - ✔ ✔ ✔ 79.0 (0.8) 89.8 (0.3) 91.3 (0.2) 92.9 (0.1)
it does not adequately leverage the global context of neural (1) ✗ ✗ ✔ 63.3(2.7) 69.0 (2.8) 68.1 (2.9) 59.8 (2.3)
architectures. F LOWER F ORMER, on the other hand, uses (3) ✔ ✗ ✔ 59.5(2.6) 69.0 (1.8) 50.2 (15.0) 45.3 (19.4)
NB 301
(4) ✔ ✔ ✗ 60.9 (3.1) 69.8 (1.4) 70.9 (1.2) 67.7 (3.1)
the global attention mechanism to capture the graph-level - ✔ ✔ ✔ 64.2 (1.6) 72.2 (1.0) 73.6 (1.3) 77.5 (0.7)
(i.e., architecture-level) characteristics, empowering F LOW-
ER F ORMER to yield better representations of architectures. module (i.e., eliminating both asynchronous and forward-
In summary, our empirical findings substantiate that backward message passing), (2) without asynchronous mes-
F LOWER F ORMER serves as an effective predictor of neu- sage passing, (3) without forward-backward message pass-
ral architecture performance. ing, and (4) without flow-aware global attention.
4.3. Ablation studies
As shown in Tab. 4, F LOWER F ORMER, which is
In this subsection, we conduct ablation studies to validate equipped with all the components, consistently outperforms
the design choices made in F LOWER F ORMER. Specifi- all variants in most settings, confirming the efficacy of
cally, we aim to analyze the necessity of (a) asynchronous our design choices. Further observations deserve attention.
message-passing, (b) forward-backward message-passing, First, the necessity of asynchronous message passing for
and (c) flow-aware global attention. To this end, we use four capturing flows is confirmed by the superior performance
variants of F LOWER F ORMER (1) without the flow encode of (3) over (1), and that of F LOWER F ORMER over (2). Sec-
Table 5. Kendall’s Tau (scaled up by a factor of 100, mean and Table 6. Training and inference times on NAS-Bench-101 with a
standard deviation of 9 experiments) on two datasets beyond the training ratio of 1%, 200 epochs, and a batch size of 128.
computer vision domain: NAS-Bench-Graph (NB-G) [36] and Encoder Training time (sec) Inference time (sec) # Params
NAS-Bench-ASR (NB-ASR) [28]. In each setting, the best perfor- NAR-Former [50] 278.13 (13.01) 2.55 (0.07) 4,882,081
mances are highlighted in green. In most cases, F LOWER F ORMER TA-GATES [32] 62.66 (0.54) 3.01 (0.21) 348,065
F LOWER F ORMER 58.08 (1.39) 2.94 (0.07) 901,459
performs best.
Training portions ing F LOWER F ORMER is 4.44× faster than training NAR-
Dataset Encoder
1% 5% 10% 50% Former. This substantial speed advantage stems from the
DAGNN [43] 48.1 (3.2) 64.4 (1.2) 67.4 (1.1) 73.1 (0.8) notable difference in model sizes, with NAR-Former hav-
DAGFormer [26] 47.9 (0.6) 60.8 (1.6) 64.9 (1.0) 72.4 (0.3)
NB-G
TA-GATES [32] 33.1 (1.4) 34.1 (2.0) 35.4 (0.8) 35.7 (0.5)
ing 5.35× the number of parameters compared to F LOW-
F LOWER F ORMER 49.5 (1.1) 65.9 (1.3) 68.9 (0.6) 72.7 (0.2) ER F ORMER . Compared to TA-GATES, F LOWER F ORMER
DAGNN [43] 29.5 (3.9) 40.9 (2.4) 45.2 (1.3) 44.0 (0.4) exhibits a slight speed advantage. Despite the small model
DAGFormer [26] 29.9 (5.4) 42.5 (1.1) 45.3 (1.0) 34.6 (5.8)
NB-ASR
TA-GATES [32] 34.0 (2.3) 41.4 (2.0) 44.9 (2.2) 50.9 (0.8)
size of TA-GATES, our specialized batch operations boost
F LOWER F ORMER 31.1 (8.0) 44.0 (0.9) 47.3 (1.3) 52.2 (1.4) the training of F LOWER F ORMER. Refer to the supplemen-
tary material for details of the batch operations.
ond, the advantage of forward-backward message passing is
In terms of inference time, there is not much difference
demonstrated by F LOWER F ORMER’s superiority over (3).
among the three methods, and F LOWER F ORMER ranks sec-
Lastly, incorporating flow-awareness into global attention
ond. In practical scenarios, neural architecture performance
is advantageous, as evidenced by F LOWER F ORMER’s ad-
prediction involves collecting labels (e.g., ground-truth per-
vantage over variant (4).
formance) for the architectures in the training set, which
4.4. Performance in various domains requires time-consuming training of the architectures. For
example, in the case of NAS-Bench-101, training just 1%
Since our input modeling does not require complex prepro-
of the architectures can take up to 24 GPU hours. Thus, in-
cessing, it can be readily applied to architectures across var-
ference speed is not a bottleneck, due to the extensive com-
ious domains. We apply F LOWER F ORMER to graph neu-
putational cost of training.
ral networks on NAS-Bench-Graph and automatic speech
recognition architectures on NAS-Bench-ASR. Among the
baseline methods used in Sec. 4.2, we use the best method 5. Conclusions
of each type: DAGNN, DAGFormer, and TA-GATES. In this work, we propose F LOWER F ORMER, a novel graph
As shown in Tab. 5, F LOWER F ORMER consistently per- transformer model designed for neural architecture encod-
forms best in most cases. These results indicate that ing. F LOWER F ORMER excels at capturing information
F LOWER F ORMER effectively captures important architec- flows within neural architectures, considering both local
tural characteristics across various domains. TA-GATES, and global aspects. Through comprehensive evaluations
which is tailored for encoding architectures in the computer across five benchmarks for architecture performance predic-
vision domain, also shows strong performance in the do- tion, F LOWER F ORMER exhibits significant and consistent
main of automatic speech recognition. TA-GATES effec- superiority over several state-of-the-art baseline methods,
tively updates operation embeddings by multiplying oper- ultimately achieving state-of-the-art performance. Notably,
ation embeddings and input information vectors, which is F LOWER F ORMER’s superiority extends beyond computer-
akin to convolutional mechanisms prevalent in auto speech vision architectures, demonstrating its effectiveness for
recognition architectures. However, its effectiveness dimin- graph-learning and speech-recognition architectures.
ishes in scenarios where message passing between nodes, a
key characteristic of graph neural networks, is required.
Acknowledgements
4.5. Training and inference speed
This work was supported by Institute of Information & Communi-
In this subsection, we compare the training and inference cations Technology Planning & Evaluation (IITP) grant funded by
speeds of F LOWER F ORMER and two state-of-the-art neu- the Korea government (MSIT) (No. 2022-0-00871, Development
ral architecture encoding methods: NAR-Former and TA- of AI Autonomy and Knowledge Enhancement for AI Agent Col-
GATES. To this end, we train all the models for 200 epochs laboration) (No. 2019-0-00075, Artificial Intelligence Graduate
with a batch size of 128, using an NVIDIA RTX 2080 GPU. School Program (KAIST)).
We use NAS-Bench-101 with a training ratio of 1%. For a
fair comparison, we exclude all additional time-consuming
References
training strategies of NAR-Former and TA-GATES (e.g., in-
put augmentation) in this experiment. [1] Abien Fred Agarap. Deep learning using rectified linear units
As shown in Tab. 6, F LOWER F ORMER takes the shortest (relu). arXiv preprint arXiv:1803.08375, 2018. 5
training time among the three methods. In particular, train- [2] Uri Alon and Eran Yahav. On the bottleneck of graph neu-
ral networks and its practical implications. arXiv preprint [18] Thomas N Kipf and Max Welling. Semi-supervised classi-
arXiv:2006.05205, 2020. 1 fication with graph convolutional networks. arXiv preprint
[3] Xavier Bresson and Thomas Laurent. Residual gated graph arXiv:1609.02907, 2016. 2
convnets. arXiv preprint arXiv:1711.07553, 2017. 6, 7, 11, [19] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent
12 Létourneau, and Prudencio Tossou. Rethinking graph trans-
[4] Michail Chatzianastasis, George Dasoulas, Georgios Siolas, formers with spectral attention. In NeurIPS, 2021. 2
and Michalis Vazirgiannis. Graph-based neural architecture [20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
search with operation embeddings. In ICCV, 2021. 2 layers of features from tiny images. 2009. 11
[5] Yaofo Chen, Yong Guo, Qi Chen, Minli Li, Wei Zeng, [21] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon
Yaowei Wang, and Mingkui Tan. Contrastive neural architec- Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
ture search with neural architecture comparators. In CVPR, Huang, and Kevin Murphy. Progressive neural architecture
2021. 1 search. In ECCV, 2018. 2
[6] Ziye Chen, Yibing Zhan, Baosheng Yu, Mingming Gong, [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
and Bo Du. Not all operations contribute equally: Hier- regularization. In ICLR, 2019. 11
archical operation-adaptive predictor for neural architecture [23] Shun Lu, Jixiang Li, Jianchao Tan, Sen Yang, and Ji
search. In ICCV, 2021. 1 Liu. Tnasp: A transformer-based nas predictor with a self-
[7] Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, evolution framework. In NeurIPS, 2021. 2
Qiuying Peng, Cheng Cheng, and Yue Qi. Graph propaga- [24] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan
tion transformer for graph representation learning. In IJCAI, Liu. Neural architecture optimization. NeurIPS, 2018. 1, 2
2023. 1, 3
[25] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen,
[8] Hsin-Pai Cheng, Tunhou Zhang, Yixing Zhang, Shiyu Li,
and Tie-Yan Liu. Semi-supervised neural architecture
Feng Liang, Feng Yan, Meng Li, Vikas Chandra, Hai Li, and
search. In NeurIPS, 2020. 1, 2
Yiran Chen. Nasgem: Neural architecture search via graph
[26] Yuankai Luo, Veronika Thost, and Lei Shi. Transformers
embedding method. In AAAI, 2021. 2
over directed acyclic graphs. In NeurIPS, 2023. 2, 5, 6, 7, 8,
[9] Maxwell Crouse, Ibrahim Abdelaziz, Cristina Cornelio,
11, 12
Veronika Thost, Lingfei Wu, Kenneth Forbus, and Achille
Fokoue. Improving graph neural network representations [27] Liheng Ma, Chen Lin, Derek Lim, Adriana Romero-Soriano,
of logical formulae with subgraph pooling. arXiv preprint Puneet K Dokania, Mark Coates, Philip Torr, and Ser-Nam
arXiv:1911.06904, 2019. 11 Lim. Graph inductive biases in transformers without mes-
sage passing. In ICML, 2023. 1, 3
[10] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the
scope of reproducible neural architecture search. In ICLR, [28] Abhinav Mehrotra, Alberto Gil CP Ramos, Sourav Bhat-
2020. 6, 11 tacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas
[11] Vijay Prakash Dwivedi and Xavier Bresson. A general- Chau, Mohamed S Abdelfattah, Samin Ishtiaq, and
ization of transformer networks to graphs. arXiv preprint Nicholas Donald Lane. Nas-bench-asr: Reproducible neural
arXiv:2012.09699, 2020. 2 architecture search for speech recognition. In ICLR, 2020. 6,
[12] C. Wei et al. Npenas: Neural predictor guided evolution 8, 11
for neural architecture search. IEEE Transactions on Neu- [29] Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien
ral Networks and Learning Systems, 2022. 12 Mairal. Graphit: Encoding graph structure in transformers.
[13] John S Garofolo, Lori F Lamel, William M Fisher, arXiv preprint arXiv:2106.05667, 2021. 2
Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic- [30] Hoang D Nguyen, Xuan-Son Vu, and Duc-Trong Le. Modu-
phonetic continous speech corpus cd-rom. nist speech disc lar graph transformer networks for multi-label image classi-
1-1.1. NASA STI/Recon technical report n, 93:27403, 1993. fication. In AAAI, 2021. 2
11 [31] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Huazhong Yang. A generic graph-based neural architecture
Deep residual learning for image recognition. In CVPR, encoding scheme for predictor-based nas. In ECCV, 2020. 2,
2016. 1 6, 11
[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- [32] Xuefei Ning, Zixuan Zhou, Junbo Zhao, Tianchen Zhao,
ian Q Weinberger. Densely connected convolutional net- Yiping Deng, Changcheng Tang, Shuang Liang, Huazhong
works. In CVPR, 2017. 1 Yang, and Yu Wang. Ta-gates: An encoding scheme for neu-
[16] Md Shamim Hussain, Mohammed J Zaki, and Dhar- ral network architectures. In NeurIPS, 2022. 2, 3, 6, 7, 8, 11,
mashankar Subramanian. Global self-attention as a replace- 12
ment for graph convolution. In KDD, 2022. 1, 2, 3 [33] Peisong Niu, Tian Zhou, Qingsong Wen, Liang Sun, and
[17] Yinghui Jiang, Shuting Jin, Xurui Jin, Xianglu Xiao, Wenfan Tao Yao. Chemistry guided molecular graph transformer.
Wu, Xiangrong Liu, Qiang Zhang, Xiangxiang Zeng, Guang In NeurIPS 2022 Workshop: AI for Science: Progress and
Yang, and Zhangming Niu. Pharmacophoric-constrained Promises, 2022. 1, 2
heterogeneous graph transformer model for molecular prop- [34] Kenta Oono and Taiji Suzuki. Graph neural networks expo-
erty prediction. Communications Chemistry, 6(1):60, 2023. nentially lose expressive power for node classification. arXiv
2 preprint arXiv:1905.10947, 2019. 1
[35] Yunsheng Pang, Qiuhong Ke, Hossein Rahmani, James Bai- [52] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng,
ley, and Jun Liu. Igformer: Interaction graph transformer Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do
for skeleton-based human interaction recognition. In ECCV, transformers really perform badly for graph representation?
2022. 2 In NeurIPS, 2021. 1, 2, 3
[36] Yijian Qin, Ziwei Zhang, Xin Wang, Zeyang Zhang, and [53] Arber Zela, Julien Siems, and Frank Hutter. Nas-bench-
Wenwu Zhu. Nas-bench-graph: Benchmarking graph neu- 1shot1: Benchmarking and dissecting one-shot neural archi-
ral architecture search. In NeurIPS, 2022. 6, 8, 11 tecture search. In ICLR, 2019. 6
[37] Ladislav Rampášek, Michael Galkin, Vijay Prakash [54] Arber Zela, Julien Siems, Lucas Zimmer, Jovita Lukasik,
Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Margret Keuper, and Frank Hutter. Surrogate nas bench-
Recipe for a general, powerful, scalable graph transformer. marks: Going beyond the limited search spaces of tabular
In NeurIPS, 2022. 1, 2, 6, 7, 11, 12 nas benchmarks. In ICLR, 2022. 6, 11
[38] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, [55] Yi Zheng, Rushin H Gindra, Emily J Green, Eric J Burks,
Wenbing Huang, and Junzhou Huang. Self-supervised graph Margrit Betke, Jennifer E Beane, and Vijaya B Kolacha-
transformer on large-scale molecular data. In NeurIPS, 2020. lama. A graph-transformer for whole slide image classifica-
2 tion. IEEE transactions on medical imaging, 41(11):3003–
[39] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, 3015, 2022. 2
Brian Galligher, and Tina Eliassi-Rad. Collective classifica-
tion in network data. AI magazine, 29(3):93–93, 2008. 11
[40] Pranab Kumar Sen. Estimates of the regression coefficient
based on kendall’s tau. Journal of the American statistical
association, 63(324):1379–1389, 1968. 6
[41] Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James Kwok,
and Tong Zhang. Bridging the gap between sample-based
and one-shot neural architecture search with bonas. NeurIPS,
2020. 1
[42] Yu Shi, Shuxin Zheng, Guolin Ke, Yifei Shen, Jiacheng You,
Jiyan He, Shengjie Luo, Chang Liu, Di He, and Tie-Yan Liu.
Benchmarking graphormer on large-scale molecular model-
ing datasets. arXiv preprint arXiv:2203.04810, 2022. 2
[43] Veronika Thost and Jie Chen. Directed acyclic graph neural
networks. In ICLR, 2021. 4, 6, 7, 8, 11, 12
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 1
[45] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and
Rodrigo Fonseca. Alphax: exploring neural architectures
with deep neural networks and monte carlo tree search. arXiv
preprint arXiv:1903.11059, 2019. 2
[46] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Ben-
der, and Pieter-Jan Kindermans. Neural predictor for neural
architecture search. In ECCV, 2020. 1
[47] Colin White, Willie Neiswanger, and Yash Savani. Bananas:
Bayesian optimization with neural architectures for neural
architecture search. In AAAI, 2021. 1, 2
[48] Colin White, Arber Zela, Robin Ru, Yang Liu, and Frank
Hutter. How powerful are performance predictors in neural
architecture search? In NeurIPS, 2021. 1
[49] Shen Yan, Kaiqiang Song, Fei Liu, and Mi Zhang. Cate:
Computation-aware neural architecture encoding with trans-
formers. In ICML, 2021. 2, 5
[50] Yun Yi, Haokui Zhang, Wenze Hu, Nannan Wang, and Xi-
aoyu Wang. Nar-former: Neural architecture representation
learning towards holistic attributes prediction. In CVPR,
2023. 2, 3, 6, 7, 8, 12
[51] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real,
Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards
reproducible neural architecture search. In ICML, 2019. 2,
6, 11
A. Dataset description where h′i is the i-th entry of h′ . Finally, we started asyn-
chronous backward message passing (step 3 in Algorithm 1
In this section, we provide detailed descriptions of the of the main paper) with updated h1,o and h2,o .
datasets used.
Training and hyperparameters: We used the AdamW op-
• NAS-Bench-101 [51] is a dataset with 423K architectures timizer [22] to train the models, and the best parameters
trained on the CIFAR-10 [20] dataset. NAS-Bench-101 were selected using early stopping. The hyperparameter
has an operation-on-node (OON) search space [31, 32]. search space was as follows:
Following Ning et al. [32], we used the same subset of the • lr ∈ [10−4 , 10−2 ]
NAS-Bench-101 dataset. This subset consists of 14,580 • weight decay ∈ [10−10 , 10−3 ]
architectures. • margin ∈ {0.01, 0.05, 0.1, 0.5, 1.0}
• NAS-Bench-201 [10] is a dataset with 15K architectures • L ∈ {4, 5, 6, 7, 8, 9, 10}
trained on the CIFAR-10 dataset. We transformed the • d ∈ {64, 128, 256, 512}
dataset, which is originally operation-on-edge (OOE)- • s ∈ {4, 8}
based, into the OON format. • dropout ∈ {0.1, 0.2, 0.3, 0.4, 0.5}
• NAS-Bench-301 [54] is a surrogate benchmark with 57K Further details regarding hyperparameters, including the
architectures each of which consists of two cells (spec., best hyperparameter combination in each dataset, are avail-
normal and reduction cells). This dataset is originally able at http://github.com/y0ngjaenius/CVP
OOE-based, and we converted the dataset into the OON R2024_FLOWERFormer.
format. Following Ning et al. [32], we only used the an-
chor architecture-performance pairs. B.2. Batch operation
• NAS-Bench-ASR [28] is a dataset with 8K architectures
of auto speech recognition models, trained on the TIMIT Asynchronous message passing inevitably introduces some
audio dataset [13]. We transformed the dataset, which is delay, since each operation should be performed in a se-
originally OOE-based, into the OON format. quential manner. In order to accelerate the computation,
• NAS-Bench-Graph [36] is a dataset with 26K archi- we employed group-based batch processing. Specifically,
tectures of graph neural networks, trained on the Cora we used the topological batching strategy [9, 43], which is
dataset [39]. Since the dataset is originally OON-based, specialized in handling asynchronous operations. First, we
no additional transformation is required. grouped nodes that belong to the same topological gener-
ation and regarded each group as a single batch. Then, in-
B. Experimental Details stead of updating a representation of a single node at a time,
we updated representations of nodes that belong to the same
B.1. Implementation details batch simultaneously. Note that this simultaneous update
process ensures the same result as updating each node in
In this subsection, we provide several implementation de-
a batch one by one since the updating processes of nodes
tails of F LOWER F ORMER.
in the same topological generation are independent of each
Code implementation: We employed the framework of
other. In this manner, for a one-way message passing, we
GraphGPS [37] as the backbone to implement F LOWER -
performed |T G | operations, which is generally smaller than
F ORMER with Python 3.10, Pytorch 1.13.1, and Pytorch
the number of nodes.
Geometric 2.2.0.
Obtaining representations in two-cell datasets: To ob- B.3. Baseline methods
tain representations in two-cell-based datasets (e.g., NAS-
Bench-301), we used the following projection strategy: • GatedGCN [3] and GraphGPS [37]: We used the
Let h1,o ∈ Rd and h2,o ∈ Rd denote the embeddings GatedGCN implementation provided by the GraphGPS
of the output nodes of cell 1 and cell 2 after forward mes- repository. For GraphGPS, we used GatedGCN and Per-
sage passing. Then, we concatenated h1,o and h2,o and pro- former as the MPNN and attention modules, respectively.
jected the concatenated embeddings with a learnable pro- We followed the choice used for OGBG-CODE2, which
jection matrix W P ∈ R2d×2d , as follows: is the only dataset modeled as a DAG in [37]. The GitHub
repository is https://github.com/rampasek/
\label {eq:updated_embs} h' = \operatorname {concat}{(h_{1, o}, h_{2, o})} W^\text {P}. (8) GraphGPS
• DAGNN [43]: This model has a bidirectional option, and
Then, we split h′ ∈ R2d into two and regarded each split as we considered whether to use it or not as a hyperparame-
an embedding of the output nodes: ter. The GitHub repository is https://github.com
/vthost/DAGNN
h_{1, o} &= (h'_1, h'_2, \cdots , h'_d), \\ h_{2, o} &= (h'_{d+1}, h'_{d+2}, \cdots , h'_{2d}),
• DAGFormer [26]: DAGFormer introduces a framework
(10) that is applicable to existing graph transformers, We used
the DAG+GraphGPS setting, which uses depth positional As shown in Table 7, F LOWER F ORMER outperforms NAR-
encoding and replaces the attention module of GraphGPS Former in the latency prediction task.
with reachability attention. The GitHub repository is ht
tps://github.com/LUOyk1999/DAGformer Table 7. Mean Absolute Percentage Error (MAPE) and Error
• NAR-Former [50]: We followed the augmentation tech- Bound Accuracy (ACC) at δ (scaled up by a factor of 100, mean
over 9 trials) of latency prediction on the NAS-Bench-201 dataset.
nique and hyperparameter setting of NAR-Former used in
In each setting, the best performances are highlighted in green.
[50] for each dataset. The GitHub repository is https:
Metric MAPE↓ ACC (δ = 0.1%) ↑ ACC (δ = 1%) ↑ ACC (δ = 5%) ↑
//github.com/yuny220/NAR-Former Training ratio 5% 10% 5% 10% 5% 10% 5% 10%
• TA-GATES [32]: We followed the hyperparameter set- NAR-Former 3.1 3.0 2.3 2.3 21.9 22.9 80.8 82.2
ting of TA-GATES used in [32] for each dataset. While F LOWER F ORMER 1.1 0.9 8.6 12.7 67.2 78.3 97.4 97.0

the NAS-Bench-ASR dataset is OOE-based with multi-


edges, the original TA-GATES implementation does not
support multi-edges. Therefore, we converted the dataset C.3. Evaluation with additional metrics
into the OON format. The GitHub repository is https: To evaluate the superior performance of F LOWER F ORMER
//github.com/walkerning/aw_nas across different evaluation criteria, we examine perfor-
mance on NAS-Bench-101 using the Pearson Coefficient
C. Additional Experiments and Results of Linear Correlation (LC) and Root Mean Squared Error
(RMSE). As shown in Table 8, F LOWER F ORMER shows
C.1. Neural architectures search experiments the best performance in all the settings.
To validate the practical utility of F LOWER F ORMER, we
Table 8. Linear Correlation (LC) and Root Mean Squared Error
conduct a series of Neural Architecture Search (NAS)
(RMSE) (mean over 9 trials) on the NAS-Bench-101 dataset. In
experiments. We employ NPENAS [12] as the back- each setting, the best performances are highlighted in green.
bone search algorithm, using TA-GATES, DAGNN, NAR-
Metric LC↑ RMSE↓
Former, and F LOWER F ORMER as performance predictors. Training ratio 1% 5% 10% 50% 1% 5% 10% 50%
We follow the experimental setup suggested in [12],with DAGNN 0.4381 0.4919 0.5201 0.5876 0.0813 0.0802 0.0791 0.0755
TA-GATES 0.3303 0.3432 0.5087 0.5677 0.0834 0.0831 0.0803 0.0777
the modification of conducting 100 trials. The results in
F LOWER F ORMER 0.5636 0.6583 0.6605 0.7483 0.0768 0.0694 0.0670 0.0614
Figure 7 substantiate F LOWER F ORMER’s superior perfor-
mance compared to baseline methods.
C.4. Extended evaluation on additional dataset
Figure 7. The average test error of the best neural architectures
obtained by the NPENAS algorithm using different performance In this section, we analyze the performance of F LOWER -
predictors on the NAS-Bench-101 dataset over 100 trials. The plot F ORMER on ENAS, an additional dataset consisting of two
shows that F LOWER F ORMER consistently outperforms other pre- cells. As shown in Table 9, F LOWER F ORMER achieves
dictors in achieving lower test error rates, establishing its superior- the second-best performance in the dataset. We hypothe-
ity in guiding the NAS process toward more accurate architectural size that the sub-optimal performance of F LOWER F ORMER
choices. stems from its failure to account for interactions between
Neural Architecture Search (NB 101) two cells. Although there is information flow between two
FlowerFormer
best architecture (%)

TA-GATES
cells, F LOWER F ORMER lacks a dedicated global attention
DAGNN module that can capture their interactions. This limitation
test error of

NAR-Former
suggests that enhancing the global attention module to in-
corporate strategies like cross-attention could be a valuable
future research direction.
# test architectures Table 9. Kendall’s Tau (scaled up by a factor of 100, mean over 9
trials) on the ENAS dataset. In each setting, the best performances
are highlighted in green.
C.2. Latency prediction experiments Datasets ENAS Avg.
Training portions 1% 5% 10% 50% Rank
To measure the encoding quality of F LOWER F ORMER in
GatedGCN [3] 15.0 36.1 41.2 54.7 4.75
various aspects and validate its effectiveness, we conduct DAGNN [43] 31.0 47.0 52.6 61.3 1.25
a latency prediction experiment on NAS-Bench-201, com- GraphGPS [37] 6.9 26.5 34.2 51.2 6.00
paring F LOWER F ORMER with NAR-Former [50]. For this DAGFormer [26] 12.2 41.4 46.5 57.9 4.25
comparison, we utilize Mean Absolute Percentage Error TA-GATES [32] 22.9 45.2 49.4 61.2 2.50

(MAPE) and Error Bound Accuracy (Acc(δ)), the same F LOWER F ORMER 18.8 44.3 49.5 64.7 2.25

metrics employed by Yi et al. [50] for latency prediction.

You might also like