Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
Flowerformer: Empowering Neural Architecture Encoding Using A Flow-Aware Graph Transformer
Figure 2. Overview of proposed F LOWER F ORMER, which contains two key modules in each of its layers: the flow encode module and the
flow-aware global attention module. The flow encode module performs bidirectional asynchronous message passing, inspired by forward
and backward passes, to produce a node embedding matrix Hflow . The flow-aware global attention module computes attention with a flow-
based masking scheme to yield another node embedding matrix Hglobal . These two embedding matrices, Hflow and Hglobal , are combined and
then projected to produce updated node embeddings at each layer. This process is iterated over L layers, and the output node embeddings
are aggregated to form the final architecture embedding, which is fed into a regressor for performance prediction.
0
3. Proposed method: F LOWER F ORMER in node operation
0 1 2 3 4 in 1x1 3x3 mp out
In this section, we present F LOWER F ORMER (Flow-aware 1
3x3 0 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎
graph transformer), a graph transformer model designed 1 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎
2
to capture information flows within an input neural archi- mp 2 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎
3
tecture. First, we provide the motivation behind F LOWER - 1x1 3 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟏𝟏 𝟎𝟎 𝟎𝟎 𝟎𝟎
F ORMER in Sec. 3.1. Then, we describe how an input neural 4 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟎𝟎 𝟏𝟏
4
architecture is represented as a graph in Sec. 3.2. After that,
out 𝐀𝐀 𝐗𝐗
we elaborate on how F LOWER F ORMER learns the represen-
tation of the neural architecture graph. Specifically, we de- Figure 3. An example neural architecture from the NAS-Bench-
scribe two core modules of F LOWER F ORMER, collectively 101 dataset, represented as a directed acyclic graph (DAG), and
referred to as F LOWER, in Sec. 3.3. Lastly, we present the its adjacency matrix A. Each column of the node feature matrix
overall framework (refer to Fig. 2) in Sec. 3.4. X corresponds to a specific operation, and each row in X is a
one-hot vector indicating the type of operation associated with the
3.1. Motivation of capturing information flows corresponding node.
Despite the remarkable success of Graph Transformers example can be found on the left-hand side of Figure 3.
(GTs) in various graph-level tasks, including graph classifi- We denote the graph representation of a neural archi-
cation [16, 52] and regression [7, 27], their application for tecture by G = (A, X), a tuple of an adjacency matrix
encoding neural architectures has received relatively limited A ∈ {0, 1}N ×N and a node (i.e., operation) feature ma-
attention. Existing applications of GTs suggest that addi- trix X ∈ {0, 1}N ×D , where N is the number of nodes and
tional design choices for accurately capturing the underly- D is the number of operations. The adjacency matrix en-
ing characteristics of input graphs (on top of global attention codes direct connections between node pairs in a graph. Its
mechanism between all pairs of nodes) are essential for the binary entries indicate whether a directional edge exists be-
effectiveness of GTs. Refer to Sec. 2.2 for some examples. tween each pair of nodes. Specifically, the (i, j)-th entry of
In this work, we focus on a crucial aspect: capturing A is set to 1 if there is a directed edge from the i-th node
information flows within neural architectures (i.e., input (denoted as vi ) to the j-th node (denoted as vj ), and 0 other-
graphs). Information flows include both the forward pass wise. Each node is associated with a one-hot feature vector
of data and the backpropagation of gradients. Hence, it is representing its corresponding operation, and these vectors
essential to capture information flows for incorporating how are stacked vertically to form the node feature matrix X.
neural architectures are trained and conduct inference into Refer to Fig. 3 for an example.
their embeddings (i.e., the encoded neural architectures). With our general input modeling scheme, F LOWER -
F ORMER is readily applicable to different domains and neu-
3.2. Input modeling
ral architectures without such additional modelings or steps.
We represent a given neural architecture as a directed By contrast, state-of-the-art neural encoding methods often
acyclic graph (DAG), with each node representing an oper- rely on complex modelings and/or preprocessing steps, such
ation (e.g., pooling or convolution). Each directional edge as the specialized treatment of specific operations [32] and
between two nodes indicates the information flow between isomorphic augmentations [50] (refer to Sec. 2.1). The em-
the corresponding operations, aligning with the direction of pirical superiority of F LOWER F ORMER (refer to Sec. 4) de-
data propagation during the forward pass. An illustrative spite its straightforward (yet elegant) input modeling is at-
Algorithm 1: Flow encode module 3 𝑻𝑻𝑮𝑮𝟏𝟏 = 𝟏𝟏, 𝟐𝟐
Input: (1) G = (A, X): an input neural architecture 1 𝑻𝑻𝑮𝑮𝟐𝟐 = {𝟑𝟑, 𝟒𝟒, 𝟓𝟓}
(2) H: an input node embedding matrix 4 6 7 𝑻𝑻𝑮𝑮𝟑𝟑 = {𝟔𝟔}
Output: H: updated node embedding matrix 2
5 𝑻𝑻𝑮𝑮𝟒𝟒 = {𝟕𝟕}
1 /∗ step 1. topological sorting ∗ /
2 T G ← topological generations of G Figure 4. Example topological generations. Nodes 1 and 2 are
3 /∗ step 2. asynchronous forward message passing ∗ / devoid of incoming edges, and thus they constitute the first topo-
4 for k = 1, . . . , |T G | do logical generation T1G . Upon removal of nodes 1 and 2, nodes 3,
5 for vj ∈ TkG do 4, and 5 no longer have incoming edges, and thus they compose
6 hj ← Comb(hj , Agg{me (hj , hi ) : Aij = 1}) the second generation T2G . Subsequently, nodes 6 and 7 form the
third and fourth generations, respectively.
7 /∗ step 3. asynchronous backward message passing ∗ /
8 for k = |T G |, . . . , 1 do tion, denoted as T1G , comprises the nodes without incom-
9 for vj ∈ TkG do ing edges in G. Then, for each k > 1, the k-th topolog-
10 hj ← Comb(hj , Agg{me (hj , hi ) : Aji = 1}) ical generation TkG comprises the nodes without incoming
11 return H edges when all preceding generations are removed from G.
The set of non-empty topological generations is denoted as
tributed to our novel flow-aware GT architecture, which is T G := {T1G , . . . , T|T
G
G | }. Refer to Fig. 4 for an example.
described in the following subsection. These topological generations are closely related to the
data flow within a neural architecture. For the operations
3.3. F LOWER layers
(i.e., nodes) in each generation to be executed, all operations
In this section, we introduce F LOWER layers, the basic in the preceding generations need to be complete. Con-
units of F LOWER F ORMER. A F LOWER layer consists of versely, during the process of backpropagation, gradients
two core components: the flow encode module and the flow- flow from subsequent generations to preceding generations.
aware global attention module. The flow encode module Forward message passing (Line 4-Line 6): During the
is a message-passing neural network (MPNN) that asyn- forward message passing step, node embeddings are up-
chronously passes messages in the forward and then the dated asynchronously, following the order of the topolog-
backward orders. The flow-aware global attention mod- ical generations, akin to the forward pass within neural ar-
ule is a self-attention module based on a flow-aware mask- chitectures. For each node vj , its embedding hj (i.e., the j-
ing scheme. The outputs of the flow encode module and th row vector of H) is updated by the following three steps:
the flow-aware global attention module are node embedding (1) computing the message me (hj , hi ) for each incoming
(l) (l)
matrices, denoted as Hf low ∈ RN ×d and Hglobal ∈ RN ×d , neighbor vi , (2) aggregating these messages, and (3) com-
respectively, for the l-th F LOWER layer. Below, we provide bining the result with the current embedding hj (Line 6).
a detailed explanation of each module. Note that the embeddings of all incoming neighbors, which
belong to preceding generations, have already been updated
by the time message calculation occurs. Also note that this
3.3.1 Flow encode module
differs from conventional synchronous graph message pass-
As discussed in Sec. 3.1, we aim to enable a GT to cap- ing, where all node embeddings are updated simultaneously
ture the crucial aspect of neural architectures—information based on their input embeddings.
flows. To this end, the flow encode module conducts both In our implementation, we use the sum aggregation as
asynchronous forward and backward message passing, re- the Agg function. As me and Comb, we adopt the message
sembling the forward pass (i.e., inference) and backpropa- function and the combine operator used in [43], as follows:
gation (i.e., training) of neural architectures, respectively.
These message-passing procedures are carried out in the
(reversed) topological order in the input neural architecture & m_e(h_j, h_i) = \operatorname {softmax}(w^{\top }_1 h_j + w^{\top }_2 h_i) h_i, \label {eq:message}\\ & \operatorname {msg}_j = \Sigma _{i: A_{ij}=1} m_e(h_j, h_i), \label {eq:agg}\\ & \operatorname {Comb}(h_j, \operatorname {msg}_j) = \operatorname {GRU}(h_j, \operatorname {msg}_j), \label {eq:comb}
graph, leading to updated node embeddings.
Pseudocode of the flow encode module is presented (3)
in Algorithm 1. It includes topological sorting, forward
message passing, and backward message passing, in order, where w1 ∈ Rd and w2 ∈ Rd are learnable parameters.
and each of these components is described below. Backward message passing (Line 8-Line 10): After the
Topological sorting (Line 2): The first step is to divide forward message passing step, we further update node em-
nodes (i.e., operations) into topological generations. Recall beddings through backward message passing, which re-
that neural-architecture graphs are directed acyclic graphs sembles the process of backpropagation. This aligns with
(DAGs). Given a DAG G, its first topological genera- the standard practice in neural architecture training, where
𝑻𝑻𝑮𝑮𝟐𝟐 1 2 3 4 5 6 7
𝑻𝑻𝑮𝑮𝟏𝟏 1
𝑻𝑻𝑮𝑮𝟑𝟑 2 2
0 = do not attend
𝐏𝐏𝟏𝟏 3
1 3 5 7 4 1 = attend
5
𝐏𝐏𝟐𝟐 6
4 6
7 =M
Forward Message Passing
Figure 6. An example mask matrix M . Node 6 attends exclusively
Backward Message Passing to nodes that appear in any path involving the node (P1 and P2 ).
Nodes 1, 3, and 7 appear in P1 , and nodes 1, 4, and 7 appear in
P2 ; thus node 6 attends only to 1, 3, 4, and 7, as indicated by M .
Specifically, given the input node-embedding matrix
𝑻𝑻𝑮𝑮𝟑𝟑 H (ℓ−1) ∈ RN ×d and the mask matrix M , the flow-
𝑻𝑻𝑮𝑮𝟏𝟏
𝑻𝑻𝑮𝑮𝟐𝟐 aware global attention module computes its output node-
(ℓ)
= output node embedding embedding matrix Hglobal ∈ RN ×d as follows:
Figure 5. Flow encode module. During forward message passing, H^{(\ell )}_{global} = \operatorname {MMHA}(H^{(\ell -1)},H^{(\ell -1)},H^{(\ell -1)},M).~\label {eq:global} (5)
node embeddings are updated following the order of topological
generations. Conversely, during backward message passing, node Here, MMHA is the Masked Multi-Head Attention module:
embeddings are updated in the reverse order of the generations.
\operatorname {MMHA}(Q, K, V, M)=\operatorname {Concat}(\operatorname {head}_1, \dots , \operatorname {head}_s)W^0,~\label {eq:mmha}
backpropagation typically occurs after the forward pass for
loss computation. where W 0 ∈ Rsdv ×d is the learnable projection matrix, s is
During the backward message passing step, node embed- the number of heads, and
dings are updated asynchronously, following the reverse or- &\operatorname {head}_i = \operatorname {Attn}(QW_i^Q, KW_i^K, VW_i^V, M),\\ &\operatorname {Attn}(Q, K, V, M)=\left (M\odot \operatorname {Softmax}\left (\frac {QK^T}{\sqrt {d_k}}\right )\right )V.
der of the topological generations. For each node vj , the
messages from its outgoing neighbors (rather than incom-
ing neighbors) are computed and then aggregated (Line 10).
The other details remain consistent with those of the for- Here, ⊙ is element-wise multiplication; and WiQ ∈ Rd×dk ,
ward message passing in to Eq. (1), Eq. (2), and Eq. (3). WiK ∈ Rd×dk , and WiV ∈ Rd×dv denote i-th head’s learn-
Outputs: We denote the output node-embedding matrix able query, key, and value projection matrices, respectively.
of the flow encode module in the ℓ-th F LOWER layer, as We adhere to the condition dk = dv = d/s for every head.
(ℓ)
Hf low ∈ RN ×d . That is,
3.4. Overall framework: F LOWER F ORMER
H^{(\ell )}_{flow} &= \operatorname {FlowEncoder}(G, H^{(\ell -1)}). \label {eq:flow} (4)
The overall framework of F LOWER F ORMER is illustrated
Here, H (ℓ−1) ∈ RN ×d is the input node-embedding matrix in Fig. 2. For each ℓ, it derives the output node embedding
(ℓ) (ℓ)
obtained in layer-(ℓ − 1), the previous layer (Eq. (6)). matrix H (ℓ) from Hf low (Eq. (4)) and Hglobal (Eq. (5)) for
the ℓ-th layer as follows:
3.3.2 Flow-aware global attention module
\label {eq:H_ell} H^{(\ell )} = \operatorname {FeedForward}(H^{(\ell )}_{flow} + H^{(\ell )}_{global}) (6)
The flow-aware global attention module is designed to In our implementation, we employ a 2-layer MLP with
capture graph-level (i.e., architecture-level) characteristics, ReLU activation [1] as the feedforward network. As shown
complementing the flow encode module which primarily fo- in Fig. 2, note that we incorporate skip-connection and
cuses on local-level flows between directly connected oper- batch normalization in every module.
ations. To this end, we employ a global attention mech- The output H (ℓ) is used as the input of the next F LOWER
anism of GTs; moreover, to accurately reflect the flows layer, and for the first layer, we utilize a projected input
within architectures, we restrict attention scores to be com- feature matrix as the input by multiplying X with a learn-
puted only between nodes connected by at least one path able projection matrix P ∈ RD×d , i.e., H (0) = XP . Each
of the flows. Specifically, we employ a masking strat- F LOWER layer has a separate set of learnable parameters.
egy [26, 49] with a mask matrix M ∈ RN ×N defined as The node embeddings in the output H (L) , where L rep-
follows (refer to Fig. 6 for an example of M ): resents the total number of F LOWER layers, are aggre-
1 if vi lies on any directed path from vj
gated to drive the final embedding zG of the input neural-
Mij = or vj lies on any directed path from vi , architecture graph G as follows:
0 otherwise. z_G = \operatorname {READOUT}(H^{(L)}), ~\label {eq:overall:readout}
For aggregation, we use mean pooling as the readout func- Table 1. Basic information about the benchmark datasets we used.
tion in our implementation. The sizes of the training and test splits used in [32] are reported.
Refer to Sec. 4.1.3 for details about training and test splits.
Application to performance prediction: The architecture
embedding zG is used for downstream tasks. For example, Dataset Domain # trains # tests
for performance prediction, it may serve as input to a regres- NAS-Bench-101 7,290 7,290
NAS-Bench-201 Computer vision 7,813 7,812
sor that outputs the estimated performance ŷG as follows:
NAS-Bench-301 5,896 51,072
NAS-Bench-ASR Speech recognition 4,121 4,121
\hat {y}_G = \operatorname {Regressor}(z_G).~\label {eq:overall:predict} NAS-Bench-Graph Graph learning 13,103 13,103
In Sec. 4, we employ a fully connected layer as the regressor • Graph learning: We include NAS-Bench-Graph [36],
and utilize the following margin ranking loss for training which consists of graph neural networks.
both F LOWER F ORMER and the regressor: Refer to Tab. 1, for basic statistics, and the supplementary
material, for details including our preprocessing methods.
\mathcal {L} = \sum _{(i,j): y_i>y_j}\max (0, \text {margin} - (\hat {y}_i - \hat {y}_j)), (7)
4.1.2 Baseline methods
where yi and yj are the ground-truth performances of archi- We utilize six baseline approaches, categorized as follows:
tectures Gi and Gj , respectively. For each pair of architec- (a) Graph neural networks: GatedGCN [3] and directed
tures Gi and Gj in the training set such that Gi outperforms acyclic graph neural network (DAGNN) [43], (b) Graph
Gj (i.e., yi > yj ), the loss encourages ŷi to be greater than transformers: GraphGPS [37] and DAGFormer [26], and
ŷj by at least a specified margin. Such designs for loss func- (c) Neural architecture encoders: TA-GATES [32] and
tions are commonly employed when it is important to make NAR-Former [50], which are state-of-the-art methods for
relative comparisons among instances (in our case, we com- neural architecture performance prediction. We use the of-
pare neural architectures to recommend better ones). ficial implementations of these methods, and the links can
be found in the supplementary material.
4. Experiments
4.1.3 Training and Evaluation protocol
In this section, we review our experiments. For evalua-
tion, we focus on the downstream task of predicting the For model training and evaluation, we follow the setting in
performance of neural architectures. In Sec. 4.2, we com- [32], including their training and test splits. We use a sub-
pare the accuracies of F LOWER F ORMER and six baseline set of the training split as the actual training set, varying the
methods, including two state-of-the-art methods (spec., TA- size of this subset: 1%, 5%, 10%, and 50% of the training
GATES [32] and NAR-Former [50]) using three perfor- split. We use the first 40 architectures in the test split as a
mance prediction benchmark datasets composed of com- validation set for hyperparameter tuning and early stopping,
puter vision model architectures. In Sec. 4.3, we conduct the remaining ones in the split as a test set. In each set-
an ablation study to validate each component of F LOWER - ting, we perform 9 trials using the three different splits and
F ORMER. In Sec. 4.4, we extend our evaluation to datasets three different random seeds, and we report the mean and
consisting of graph neural networks and auto speech recog- standard deviation across these trials. As accuracy metrics,
nition models. In Sec. 4.5, we examine the training and we use Kendall’s Tau [40] to assess overall performance
inference speed of F LOWER F ORMER. and Precision@K (which measures the proportion of cor-
rectly predicted top-K architectures among the true top-K
4.1. Experimental settings outperforming architectures) for the performance of iden-
Below, we provide an overview of our experimental setup. tifying the best architectures. Note that these metrics are
commonly employed in the field of neural architecture en-
coding [31, 32, 50].
4.1.1 Datasets
4.2. Performance on computer vision benchmarks
We evaluate the effectiveness of neural architecture encod-
ing methods using five benchmark datasets designed for In this subsection, we focus on the computer vision bench-
performance prediction, spanning three domains: marks for which we have extensive baseline methods. In
• Computer vision: We use three datasets: NAS-Bench- Tabs. 2 and 3, we report the performance prediction accura-
101 [51, 53], NAS-Bench-201 [10], and NAS-Bench-301 cies of the considered methods using two metrics across dif-
[54]. These datasets contain computer vision models. ferent training instance ratios. Notably, F LOWER F ORMER
• Speech recognition: We employ NAS-Bench-ASR [28], consistently outperforms all baseline methods across all set-
which consists of auto speech recognition architectures. tings in terms of Kendall’s Tau. In terms of Precision@K,
Table 2. Kendall’s Tau (scaled up by a factor of 100, mean and standard deviation over 9 trials) on three datasets: NAS-Bench-101,
NAS-Bench-201, NAS-Bench 301. In each setting, the best performances are highlighted in green. NA: there is no trivial extension of
NAR-Former to NAS-Bench-301, which consists of two-cell architectures. Note that, in every setting, F LOWER F ORMER performs best.
Datasets NAS-Bench-101 NAS-Bench-201 NAS-Bench-301 Avg.
Training portions 1% 5% 10% 50% 1% 5% 10% 50% 1% 5% 10% 50% Rank
GatedGCN [3] 67.4 (6.0) 79.6 (4.1) 82.0 (5.1) 84.8 (5.9) 70.9 (1.8) 84.1 (0.6) 88.6 (0.3) 92.3 (0.1) 61.8 (2.4) 70.0 (0.9) 71.4 (1.0) 72.7 (1.5) 4.91
DAGNN [43] 72.4 (4.5) 82.9 (3.1) 84.4 (4.4) 85.9 (5.3) 75.8 (1.0) 87.5 (0.8) 90.6 (0.2) 92.6 (0.0) 61.5 (1.9) 70.9 (0.5) 73.4 (1.2) 76.1 (1.3) 2.50
GraphGPS [37] 70.6 (4.4) 81.7 (3.8) 83.9 (4.2) 85.9 (5.1) 71.3 (1.3) 82.5 (0.6) 87.8 (0.5) 92.7 (0.1) 59.7 (1.8) 69.3 (0.9) 70.7 (1.2) 73.8 (0.7) 4.75
DAGFormer [26] 73.0 (4.3) 75.6 (5.2) 77.2 (7.0) 80.9 (5.9) 73.0 (73.0) 84.9 (0.8) 88.8 (0.5) 92.7 (0.1) 61.3 (2.0) 70.7 (0.8) 72.1 (0.8) 74.8 (1.0) 3.91
NAR-Former [50] 59.4 (8.8) 72.0 (8.2) 75.5 (10.2) 79.8 (5.9) 62.3 (4.0) 80.7 (1.8) 87.3 (0.7) 88.9 (0.3) NA NA NA NA -
TA-GATES [32] 70.8 (6.0) 82.3 (2.7) 83.9 (3.5) 86.3 (3.9) 77.7 (1.7) 86.3 (0.8) 88.7 (0.3) 91.4 (0.5) 61.3 (1.2) 68.9 (1.6) 71.8 (1.6) 75.4 (0.7) 3.83
F LOWER F ORMER 75.0 (2.9) 86.1 (0.8) 88.1 (0.2) 89.6 (0.1) 80.0 (0.8) 89.8 (0.3) 91.3 (0.2) 92.9 (0.1) 64.2 (1.6) 72.2 (1.0) 73.6 (1.3) 77.5 (0.7) 1.00
Table 3. Precision@K (scaled up by a factor of 100, mean and standard deviation of 9 trials). The proportion of training samples is fixed to
5%. In each setting, the best performances are highlighted in green. NA: there is no trivial extension of NAR-Former to NAS-Bench-301,
which consists of two-cell architectures. Note that, in most cases, F LOWER F ORMER identifies top-k architectures most accurately.
Datasets NAS-Bench-101 (5%) NAS-Bench-201 (5%) NAS-Bench-301 (5%) Avg.
K (for P@Top K%) 1 5 10 50 1 5 10 50 1 5 10 50 Rank
GatedGCN [3] 44.4 (7.6) 65.6 (3.5) 76.2 (2.7) 90.5 (2.1) 42.3 (3.7) 68.5 (3.1) 80.9 (1.9) 94.1 (0.6) 19.1 (4.1) 55.2 (4.5) 71.8 (2.9) 85.4 (0.4) 4.83
DAGNN [43] 41.7 (5.9) 65.4 (4.2) 79.3 (2.9) 92.0 (1.3) 49.6 (6.2) 69.7 (3.0) 83.1 (0.7) 95.3 (0.9) 23.1 (2.1) 58.3 (3.4) 73.1 (1.5) 85.8 (0.4) 2.75
GraphGPS [37] 44.3 (12.2) 67.1 (2.7) 78.7 (1.9) 91.2 (2.0) 49.4 (4.6) 67.9 (4.9) 78.9 (3.4) 93.4 (0.3) 20.6 (2.1) 57.2 (3.8) 73.4 (2.5) 84.8 (0.5) 4.17
DAGFormer [26] 39.4 (7.9) 61.8 (5.6) 71.6 (5.0) 88.2 (2.4) 50.7 (5.8) 70.4 (2.9) 82.5 (2.3) 94.2 (0.5) 20.7 (3.4) 57.6 (3.7) 73.4 (2.5) 85.6 (0.4) 3.83
NAR-Former [50] 47.2 (9.9) 62.6 (7.9) 67.8 (8.4) 85.9 (5.2) 49.5 (6.5) 64.7 (2.0) 69.9 (2.0) 92.3 (1.0) NA NA NA NA -
TA-GATES [32] 44.6 (9.7) 66.6 (4.0) 78.1 (4.6) 91.8 (1.2) 49.4 (3.1) 66.7 (3.3) 78.1 (2.8) 94.8 (0.7) 20.1 (5.0) 56.2 (6.1) 72.4 (3.6) 84.7 (0.7) 4.33
F LOWER F ORMER 46.5 (11.2) 70.0 (1.5) 80.9 (1.8) 92.7 (1.7) 57.0 (5.4) 74.7 (1.8) 85.2 (1.3) 96.9 (0.7) 20.8 (3.7) 58.5 (2.5) 74.7 (2.4) 86.6 (0.6) 1.08
it performs best in 10 out of 12 settings, ranking second in Table 4. Comparison with four variants of F LOWER F ORMER in
the other settings. Two key observations are as follows. terms of Kendall’s Tau, using the same setups as in Tab. 2. In
The suboptimal performance of GraphGPS indicates that each setting, the best performances are highlighted in green. AS:
Asynchronous message passing. FB: Forward-backward message
a graph transformer alone is insufficient in effectively rep-
passing. GA: Global attention. In most cases, F LOWER F ORMER,
resenting neural architectures. Specifically, in terms of which is equipped with all components, outperforms all of its vari-
Kendall’s Tau, the performance gap can be as large as 8.7 ants, thereby validating the effectiveness of each component.
percentage points between F LOWER F ORMER and GPS. We,
Components Training Portions
thus, argue that our incorporation of information flows into Dataset #
AS FB GA 1% 5% 10% 50%
a graph transformer, through the introduction of the flow (1) ✗ ✗ ✔ 41.5(1.7) 42.5 (1.6) 41.1 (2.8) 43.1 (1.4)
(2) ✗ ✔ ✔ 65.5 (8.8) 56.8 (5.0) 53.8 (7.4) 69.6 (11.6)
encode module and the flow-aware global attention module, NB 101 (3) ✔ ✗ ✔ 76.7 (4.1) 83.9 (2.6) 84.6 (4.0 85.6 (5.4)
plays a pivotal role in F LOWER F ORMER’s success. (4) ✔ ✔ ✗ 76.5 (3.0) 83.2 (3.9) 83.9 (5.1) 85.3 (6.3)
The superiority of F LOWER F ORMER over TA-GATES - ✔ ✔ ✔ 75.0 (2.9) 86.1 (0.8) 88.1 (0.2) 89.6 (0.1)
(1) ✗ ✗ ✔ 75.9 (1.2) 86.5 (0.2 88.2 (0.3) 89.7 (0.1)
highlights the importance of the global attention mecha- (2) ✗ ✔ ✔ 73.7 (1.1) 85.6 (0.7) 89.2 (0.5) 92.9 (0.1)
nism. While TA-GATES may capture the information flow NB 201 (3) ✔ ✗ ✔ 76.2 (2.1) 88.6 (0.8) 90.9 (0.1) 92.9 (0.1)
(4) ✔ ✔ ✗ 79.3 (1.2) 89.5 (0.5) 91.1 (0.3) 92.9 (0.3)
at a local-level through its information propagation scheme, - ✔ ✔ ✔ 79.0 (0.8) 89.8 (0.3) 91.3 (0.2) 92.9 (0.1)
it does not adequately leverage the global context of neural (1) ✗ ✗ ✔ 63.3(2.7) 69.0 (2.8) 68.1 (2.9) 59.8 (2.3)
architectures. F LOWER F ORMER, on the other hand, uses (3) ✔ ✗ ✔ 59.5(2.6) 69.0 (1.8) 50.2 (15.0) 45.3 (19.4)
NB 301
(4) ✔ ✔ ✗ 60.9 (3.1) 69.8 (1.4) 70.9 (1.2) 67.7 (3.1)
the global attention mechanism to capture the graph-level - ✔ ✔ ✔ 64.2 (1.6) 72.2 (1.0) 73.6 (1.3) 77.5 (0.7)
(i.e., architecture-level) characteristics, empowering F LOW-
ER F ORMER to yield better representations of architectures. module (i.e., eliminating both asynchronous and forward-
In summary, our empirical findings substantiate that backward message passing), (2) without asynchronous mes-
F LOWER F ORMER serves as an effective predictor of neu- sage passing, (3) without forward-backward message pass-
ral architecture performance. ing, and (4) without flow-aware global attention.
4.3. Ablation studies
As shown in Tab. 4, F LOWER F ORMER, which is
In this subsection, we conduct ablation studies to validate equipped with all the components, consistently outperforms
the design choices made in F LOWER F ORMER. Specifi- all variants in most settings, confirming the efficacy of
cally, we aim to analyze the necessity of (a) asynchronous our design choices. Further observations deserve attention.
message-passing, (b) forward-backward message-passing, First, the necessity of asynchronous message passing for
and (c) flow-aware global attention. To this end, we use four capturing flows is confirmed by the superior performance
variants of F LOWER F ORMER (1) without the flow encode of (3) over (1), and that of F LOWER F ORMER over (2). Sec-
Table 5. Kendall’s Tau (scaled up by a factor of 100, mean and Table 6. Training and inference times on NAS-Bench-101 with a
standard deviation of 9 experiments) on two datasets beyond the training ratio of 1%, 200 epochs, and a batch size of 128.
computer vision domain: NAS-Bench-Graph (NB-G) [36] and Encoder Training time (sec) Inference time (sec) # Params
NAS-Bench-ASR (NB-ASR) [28]. In each setting, the best perfor- NAR-Former [50] 278.13 (13.01) 2.55 (0.07) 4,882,081
mances are highlighted in green. In most cases, F LOWER F ORMER TA-GATES [32] 62.66 (0.54) 3.01 (0.21) 348,065
F LOWER F ORMER 58.08 (1.39) 2.94 (0.07) 901,459
performs best.
Training portions ing F LOWER F ORMER is 4.44× faster than training NAR-
Dataset Encoder
1% 5% 10% 50% Former. This substantial speed advantage stems from the
DAGNN [43] 48.1 (3.2) 64.4 (1.2) 67.4 (1.1) 73.1 (0.8) notable difference in model sizes, with NAR-Former hav-
DAGFormer [26] 47.9 (0.6) 60.8 (1.6) 64.9 (1.0) 72.4 (0.3)
NB-G
TA-GATES [32] 33.1 (1.4) 34.1 (2.0) 35.4 (0.8) 35.7 (0.5)
ing 5.35× the number of parameters compared to F LOW-
F LOWER F ORMER 49.5 (1.1) 65.9 (1.3) 68.9 (0.6) 72.7 (0.2) ER F ORMER . Compared to TA-GATES, F LOWER F ORMER
DAGNN [43] 29.5 (3.9) 40.9 (2.4) 45.2 (1.3) 44.0 (0.4) exhibits a slight speed advantage. Despite the small model
DAGFormer [26] 29.9 (5.4) 42.5 (1.1) 45.3 (1.0) 34.6 (5.8)
NB-ASR
TA-GATES [32] 34.0 (2.3) 41.4 (2.0) 44.9 (2.2) 50.9 (0.8)
size of TA-GATES, our specialized batch operations boost
F LOWER F ORMER 31.1 (8.0) 44.0 (0.9) 47.3 (1.3) 52.2 (1.4) the training of F LOWER F ORMER. Refer to the supplemen-
tary material for details of the batch operations.
ond, the advantage of forward-backward message passing is
In terms of inference time, there is not much difference
demonstrated by F LOWER F ORMER’s superiority over (3).
among the three methods, and F LOWER F ORMER ranks sec-
Lastly, incorporating flow-awareness into global attention
ond. In practical scenarios, neural architecture performance
is advantageous, as evidenced by F LOWER F ORMER’s ad-
prediction involves collecting labels (e.g., ground-truth per-
vantage over variant (4).
formance) for the architectures in the training set, which
4.4. Performance in various domains requires time-consuming training of the architectures. For
example, in the case of NAS-Bench-101, training just 1%
Since our input modeling does not require complex prepro-
of the architectures can take up to 24 GPU hours. Thus, in-
cessing, it can be readily applied to architectures across var-
ference speed is not a bottleneck, due to the extensive com-
ious domains. We apply F LOWER F ORMER to graph neu-
putational cost of training.
ral networks on NAS-Bench-Graph and automatic speech
recognition architectures on NAS-Bench-ASR. Among the
baseline methods used in Sec. 4.2, we use the best method 5. Conclusions
of each type: DAGNN, DAGFormer, and TA-GATES. In this work, we propose F LOWER F ORMER, a novel graph
As shown in Tab. 5, F LOWER F ORMER consistently per- transformer model designed for neural architecture encod-
forms best in most cases. These results indicate that ing. F LOWER F ORMER excels at capturing information
F LOWER F ORMER effectively captures important architec- flows within neural architectures, considering both local
tural characteristics across various domains. TA-GATES, and global aspects. Through comprehensive evaluations
which is tailored for encoding architectures in the computer across five benchmarks for architecture performance predic-
vision domain, also shows strong performance in the do- tion, F LOWER F ORMER exhibits significant and consistent
main of automatic speech recognition. TA-GATES effec- superiority over several state-of-the-art baseline methods,
tively updates operation embeddings by multiplying oper- ultimately achieving state-of-the-art performance. Notably,
ation embeddings and input information vectors, which is F LOWER F ORMER’s superiority extends beyond computer-
akin to convolutional mechanisms prevalent in auto speech vision architectures, demonstrating its effectiveness for
recognition architectures. However, its effectiveness dimin- graph-learning and speech-recognition architectures.
ishes in scenarios where message passing between nodes, a
key characteristic of graph neural networks, is required.
Acknowledgements
4.5. Training and inference speed
This work was supported by Institute of Information & Communi-
In this subsection, we compare the training and inference cations Technology Planning & Evaluation (IITP) grant funded by
speeds of F LOWER F ORMER and two state-of-the-art neu- the Korea government (MSIT) (No. 2022-0-00871, Development
ral architecture encoding methods: NAR-Former and TA- of AI Autonomy and Knowledge Enhancement for AI Agent Col-
GATES. To this end, we train all the models for 200 epochs laboration) (No. 2019-0-00075, Artificial Intelligence Graduate
with a batch size of 128, using an NVIDIA RTX 2080 GPU. School Program (KAIST)).
We use NAS-Bench-101 with a training ratio of 1%. For a
fair comparison, we exclude all additional time-consuming
References
training strategies of NAR-Former and TA-GATES (e.g., in-
put augmentation) in this experiment. [1] Abien Fred Agarap. Deep learning using rectified linear units
As shown in Tab. 6, F LOWER F ORMER takes the shortest (relu). arXiv preprint arXiv:1803.08375, 2018. 5
training time among the three methods. In particular, train- [2] Uri Alon and Eran Yahav. On the bottleneck of graph neu-
ral networks and its practical implications. arXiv preprint [18] Thomas N Kipf and Max Welling. Semi-supervised classi-
arXiv:2006.05205, 2020. 1 fication with graph convolutional networks. arXiv preprint
[3] Xavier Bresson and Thomas Laurent. Residual gated graph arXiv:1609.02907, 2016. 2
convnets. arXiv preprint arXiv:1711.07553, 2017. 6, 7, 11, [19] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent
12 Létourneau, and Prudencio Tossou. Rethinking graph trans-
[4] Michail Chatzianastasis, George Dasoulas, Georgios Siolas, formers with spectral attention. In NeurIPS, 2021. 2
and Michalis Vazirgiannis. Graph-based neural architecture [20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
search with operation embeddings. In ICCV, 2021. 2 layers of features from tiny images. 2009. 11
[5] Yaofo Chen, Yong Guo, Qi Chen, Minli Li, Wei Zeng, [21] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon
Yaowei Wang, and Mingkui Tan. Contrastive neural architec- Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan
ture search with neural architecture comparators. In CVPR, Huang, and Kevin Murphy. Progressive neural architecture
2021. 1 search. In ECCV, 2018. 2
[6] Ziye Chen, Yibing Zhan, Baosheng Yu, Mingming Gong, [22] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
and Bo Du. Not all operations contribute equally: Hier- regularization. In ICLR, 2019. 11
archical operation-adaptive predictor for neural architecture [23] Shun Lu, Jixiang Li, Jianchao Tan, Sen Yang, and Ji
search. In ICCV, 2021. 1 Liu. Tnasp: A transformer-based nas predictor with a self-
[7] Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, evolution framework. In NeurIPS, 2021. 2
Qiuying Peng, Cheng Cheng, and Yue Qi. Graph propaga- [24] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan
tion transformer for graph representation learning. In IJCAI, Liu. Neural architecture optimization. NeurIPS, 2018. 1, 2
2023. 1, 3
[25] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Enhong Chen,
[8] Hsin-Pai Cheng, Tunhou Zhang, Yixing Zhang, Shiyu Li,
and Tie-Yan Liu. Semi-supervised neural architecture
Feng Liang, Feng Yan, Meng Li, Vikas Chandra, Hai Li, and
search. In NeurIPS, 2020. 1, 2
Yiran Chen. Nasgem: Neural architecture search via graph
[26] Yuankai Luo, Veronika Thost, and Lei Shi. Transformers
embedding method. In AAAI, 2021. 2
over directed acyclic graphs. In NeurIPS, 2023. 2, 5, 6, 7, 8,
[9] Maxwell Crouse, Ibrahim Abdelaziz, Cristina Cornelio,
11, 12
Veronika Thost, Lingfei Wu, Kenneth Forbus, and Achille
Fokoue. Improving graph neural network representations [27] Liheng Ma, Chen Lin, Derek Lim, Adriana Romero-Soriano,
of logical formulae with subgraph pooling. arXiv preprint Puneet K Dokania, Mark Coates, Philip Torr, and Ser-Nam
arXiv:1911.06904, 2019. 11 Lim. Graph inductive biases in transformers without mes-
sage passing. In ICML, 2023. 1, 3
[10] Xuanyi Dong and Yi Yang. Nas-bench-201: Extending the
scope of reproducible neural architecture search. In ICLR, [28] Abhinav Mehrotra, Alberto Gil CP Ramos, Sourav Bhat-
2020. 6, 11 tacharya, Łukasz Dudziak, Ravichander Vipperla, Thomas
[11] Vijay Prakash Dwivedi and Xavier Bresson. A general- Chau, Mohamed S Abdelfattah, Samin Ishtiaq, and
ization of transformer networks to graphs. arXiv preprint Nicholas Donald Lane. Nas-bench-asr: Reproducible neural
arXiv:2012.09699, 2020. 2 architecture search for speech recognition. In ICLR, 2020. 6,
[12] C. Wei et al. Npenas: Neural predictor guided evolution 8, 11
for neural architecture search. IEEE Transactions on Neu- [29] Grégoire Mialon, Dexiong Chen, Margot Selosse, and Julien
ral Networks and Learning Systems, 2022. 12 Mairal. Graphit: Encoding graph structure in transformers.
[13] John S Garofolo, Lori F Lamel, William M Fisher, arXiv preprint arXiv:2106.05667, 2021. 2
Jonathan G Fiscus, and David S Pallett. Darpa timit acoustic- [30] Hoang D Nguyen, Xuan-Son Vu, and Duc-Trong Le. Modu-
phonetic continous speech corpus cd-rom. nist speech disc lar graph transformer networks for multi-label image classi-
1-1.1. NASA STI/Recon technical report n, 93:27403, 1993. fication. In AAAI, 2021. 2
11 [31] Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, and
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Huazhong Yang. A generic graph-based neural architecture
Deep residual learning for image recognition. In CVPR, encoding scheme for predictor-based nas. In ECCV, 2020. 2,
2016. 1 6, 11
[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- [32] Xuefei Ning, Zixuan Zhou, Junbo Zhao, Tianchen Zhao,
ian Q Weinberger. Densely connected convolutional net- Yiping Deng, Changcheng Tang, Shuang Liang, Huazhong
works. In CVPR, 2017. 1 Yang, and Yu Wang. Ta-gates: An encoding scheme for neu-
[16] Md Shamim Hussain, Mohammed J Zaki, and Dhar- ral network architectures. In NeurIPS, 2022. 2, 3, 6, 7, 8, 11,
mashankar Subramanian. Global self-attention as a replace- 12
ment for graph convolution. In KDD, 2022. 1, 2, 3 [33] Peisong Niu, Tian Zhou, Qingsong Wen, Liang Sun, and
[17] Yinghui Jiang, Shuting Jin, Xurui Jin, Xianglu Xiao, Wenfan Tao Yao. Chemistry guided molecular graph transformer.
Wu, Xiangrong Liu, Qiang Zhang, Xiangxiang Zeng, Guang In NeurIPS 2022 Workshop: AI for Science: Progress and
Yang, and Zhangming Niu. Pharmacophoric-constrained Promises, 2022. 1, 2
heterogeneous graph transformer model for molecular prop- [34] Kenta Oono and Taiji Suzuki. Graph neural networks expo-
erty prediction. Communications Chemistry, 6(1):60, 2023. nentially lose expressive power for node classification. arXiv
2 preprint arXiv:1905.10947, 2019. 1
[35] Yunsheng Pang, Qiuhong Ke, Hossein Rahmani, James Bai- [52] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng,
ley, and Jun Liu. Igformer: Interaction graph transformer Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do
for skeleton-based human interaction recognition. In ECCV, transformers really perform badly for graph representation?
2022. 2 In NeurIPS, 2021. 1, 2, 3
[36] Yijian Qin, Ziwei Zhang, Xin Wang, Zeyang Zhang, and [53] Arber Zela, Julien Siems, and Frank Hutter. Nas-bench-
Wenwu Zhu. Nas-bench-graph: Benchmarking graph neu- 1shot1: Benchmarking and dissecting one-shot neural archi-
ral architecture search. In NeurIPS, 2022. 6, 8, 11 tecture search. In ICLR, 2019. 6
[37] Ladislav Rampášek, Michael Galkin, Vijay Prakash [54] Arber Zela, Julien Siems, Lucas Zimmer, Jovita Lukasik,
Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Margret Keuper, and Frank Hutter. Surrogate nas bench-
Recipe for a general, powerful, scalable graph transformer. marks: Going beyond the limited search spaces of tabular
In NeurIPS, 2022. 1, 2, 6, 7, 11, 12 nas benchmarks. In ICLR, 2022. 6, 11
[38] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, [55] Yi Zheng, Rushin H Gindra, Emily J Green, Eric J Burks,
Wenbing Huang, and Junzhou Huang. Self-supervised graph Margrit Betke, Jennifer E Beane, and Vijaya B Kolacha-
transformer on large-scale molecular data. In NeurIPS, 2020. lama. A graph-transformer for whole slide image classifica-
2 tion. IEEE transactions on medical imaging, 41(11):3003–
[39] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, 3015, 2022. 2
Brian Galligher, and Tina Eliassi-Rad. Collective classifica-
tion in network data. AI magazine, 29(3):93–93, 2008. 11
[40] Pranab Kumar Sen. Estimates of the regression coefficient
based on kendall’s tau. Journal of the American statistical
association, 63(324):1379–1389, 1968. 6
[41] Han Shi, Renjie Pi, Hang Xu, Zhenguo Li, James Kwok,
and Tong Zhang. Bridging the gap between sample-based
and one-shot neural architecture search with bonas. NeurIPS,
2020. 1
[42] Yu Shi, Shuxin Zheng, Guolin Ke, Yifei Shen, Jiacheng You,
Jiyan He, Shengjie Luo, Chang Liu, Di He, and Tie-Yan Liu.
Benchmarking graphormer on large-scale molecular model-
ing datasets. arXiv preprint arXiv:2203.04810, 2022. 2
[43] Veronika Thost and Jie Chen. Directed acyclic graph neural
networks. In ICLR, 2021. 4, 6, 7, 8, 11, 12
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017. 1
[45] Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian, and
Rodrigo Fonseca. Alphax: exploring neural architectures
with deep neural networks and monte carlo tree search. arXiv
preprint arXiv:1903.11059, 2019. 2
[46] Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Ben-
der, and Pieter-Jan Kindermans. Neural predictor for neural
architecture search. In ECCV, 2020. 1
[47] Colin White, Willie Neiswanger, and Yash Savani. Bananas:
Bayesian optimization with neural architectures for neural
architecture search. In AAAI, 2021. 1, 2
[48] Colin White, Arber Zela, Robin Ru, Yang Liu, and Frank
Hutter. How powerful are performance predictors in neural
architecture search? In NeurIPS, 2021. 1
[49] Shen Yan, Kaiqiang Song, Fei Liu, and Mi Zhang. Cate:
Computation-aware neural architecture encoding with trans-
formers. In ICML, 2021. 2, 5
[50] Yun Yi, Haokui Zhang, Wenze Hu, Nannan Wang, and Xi-
aoyu Wang. Nar-former: Neural architecture representation
learning towards holistic attributes prediction. In CVPR,
2023. 2, 3, 6, 7, 8, 12
[51] Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real,
Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards
reproducible neural architecture search. In ICML, 2019. 2,
6, 11
A. Dataset description where h′i is the i-th entry of h′ . Finally, we started asyn-
chronous backward message passing (step 3 in Algorithm 1
In this section, we provide detailed descriptions of the of the main paper) with updated h1,o and h2,o .
datasets used.
Training and hyperparameters: We used the AdamW op-
• NAS-Bench-101 [51] is a dataset with 423K architectures timizer [22] to train the models, and the best parameters
trained on the CIFAR-10 [20] dataset. NAS-Bench-101 were selected using early stopping. The hyperparameter
has an operation-on-node (OON) search space [31, 32]. search space was as follows:
Following Ning et al. [32], we used the same subset of the • lr ∈ [10−4 , 10−2 ]
NAS-Bench-101 dataset. This subset consists of 14,580 • weight decay ∈ [10−10 , 10−3 ]
architectures. • margin ∈ {0.01, 0.05, 0.1, 0.5, 1.0}
• NAS-Bench-201 [10] is a dataset with 15K architectures • L ∈ {4, 5, 6, 7, 8, 9, 10}
trained on the CIFAR-10 dataset. We transformed the • d ∈ {64, 128, 256, 512}
dataset, which is originally operation-on-edge (OOE)- • s ∈ {4, 8}
based, into the OON format. • dropout ∈ {0.1, 0.2, 0.3, 0.4, 0.5}
• NAS-Bench-301 [54] is a surrogate benchmark with 57K Further details regarding hyperparameters, including the
architectures each of which consists of two cells (spec., best hyperparameter combination in each dataset, are avail-
normal and reduction cells). This dataset is originally able at http://github.com/y0ngjaenius/CVP
OOE-based, and we converted the dataset into the OON R2024_FLOWERFormer.
format. Following Ning et al. [32], we only used the an-
chor architecture-performance pairs. B.2. Batch operation
• NAS-Bench-ASR [28] is a dataset with 8K architectures
of auto speech recognition models, trained on the TIMIT Asynchronous message passing inevitably introduces some
audio dataset [13]. We transformed the dataset, which is delay, since each operation should be performed in a se-
originally OOE-based, into the OON format. quential manner. In order to accelerate the computation,
• NAS-Bench-Graph [36] is a dataset with 26K archi- we employed group-based batch processing. Specifically,
tectures of graph neural networks, trained on the Cora we used the topological batching strategy [9, 43], which is
dataset [39]. Since the dataset is originally OON-based, specialized in handling asynchronous operations. First, we
no additional transformation is required. grouped nodes that belong to the same topological gener-
ation and regarded each group as a single batch. Then, in-
B. Experimental Details stead of updating a representation of a single node at a time,
we updated representations of nodes that belong to the same
B.1. Implementation details batch simultaneously. Note that this simultaneous update
process ensures the same result as updating each node in
In this subsection, we provide several implementation de-
a batch one by one since the updating processes of nodes
tails of F LOWER F ORMER.
in the same topological generation are independent of each
Code implementation: We employed the framework of
other. In this manner, for a one-way message passing, we
GraphGPS [37] as the backbone to implement F LOWER -
performed |T G | operations, which is generally smaller than
F ORMER with Python 3.10, Pytorch 1.13.1, and Pytorch
the number of nodes.
Geometric 2.2.0.
Obtaining representations in two-cell datasets: To ob- B.3. Baseline methods
tain representations in two-cell-based datasets (e.g., NAS-
Bench-301), we used the following projection strategy: • GatedGCN [3] and GraphGPS [37]: We used the
Let h1,o ∈ Rd and h2,o ∈ Rd denote the embeddings GatedGCN implementation provided by the GraphGPS
of the output nodes of cell 1 and cell 2 after forward mes- repository. For GraphGPS, we used GatedGCN and Per-
sage passing. Then, we concatenated h1,o and h2,o and pro- former as the MPNN and attention modules, respectively.
jected the concatenated embeddings with a learnable pro- We followed the choice used for OGBG-CODE2, which
jection matrix W P ∈ R2d×2d , as follows: is the only dataset modeled as a DAG in [37]. The GitHub
repository is https://github.com/rampasek/
\label {eq:updated_embs} h' = \operatorname {concat}{(h_{1, o}, h_{2, o})} W^\text {P}. (8) GraphGPS
• DAGNN [43]: This model has a bidirectional option, and
Then, we split h′ ∈ R2d into two and regarded each split as we considered whether to use it or not as a hyperparame-
an embedding of the output nodes: ter. The GitHub repository is https://github.com
/vthost/DAGNN
h_{1, o} &= (h'_1, h'_2, \cdots , h'_d), \\ h_{2, o} &= (h'_{d+1}, h'_{d+2}, \cdots , h'_{2d}),
• DAGFormer [26]: DAGFormer introduces a framework
(10) that is applicable to existing graph transformers, We used
the DAG+GraphGPS setting, which uses depth positional As shown in Table 7, F LOWER F ORMER outperforms NAR-
encoding and replaces the attention module of GraphGPS Former in the latency prediction task.
with reachability attention. The GitHub repository is ht
tps://github.com/LUOyk1999/DAGformer Table 7. Mean Absolute Percentage Error (MAPE) and Error
• NAR-Former [50]: We followed the augmentation tech- Bound Accuracy (ACC) at δ (scaled up by a factor of 100, mean
over 9 trials) of latency prediction on the NAS-Bench-201 dataset.
nique and hyperparameter setting of NAR-Former used in
In each setting, the best performances are highlighted in green.
[50] for each dataset. The GitHub repository is https:
Metric MAPE↓ ACC (δ = 0.1%) ↑ ACC (δ = 1%) ↑ ACC (δ = 5%) ↑
//github.com/yuny220/NAR-Former Training ratio 5% 10% 5% 10% 5% 10% 5% 10%
• TA-GATES [32]: We followed the hyperparameter set- NAR-Former 3.1 3.0 2.3 2.3 21.9 22.9 80.8 82.2
ting of TA-GATES used in [32] for each dataset. While F LOWER F ORMER 1.1 0.9 8.6 12.7 67.2 78.3 97.4 97.0
TA-GATES
cells, F LOWER F ORMER lacks a dedicated global attention
DAGNN module that can capture their interactions. This limitation
test error of
NAR-Former
suggests that enhancing the global attention module to in-
corporate strategies like cross-attention could be a valuable
future research direction.
# test architectures Table 9. Kendall’s Tau (scaled up by a factor of 100, mean over 9
trials) on the ENAS dataset. In each setting, the best performances
are highlighted in green.
C.2. Latency prediction experiments Datasets ENAS Avg.
Training portions 1% 5% 10% 50% Rank
To measure the encoding quality of F LOWER F ORMER in
GatedGCN [3] 15.0 36.1 41.2 54.7 4.75
various aspects and validate its effectiveness, we conduct DAGNN [43] 31.0 47.0 52.6 61.3 1.25
a latency prediction experiment on NAS-Bench-201, com- GraphGPS [37] 6.9 26.5 34.2 51.2 6.00
paring F LOWER F ORMER with NAR-Former [50]. For this DAGFormer [26] 12.2 41.4 46.5 57.9 4.25
comparison, we utilize Mean Absolute Percentage Error TA-GATES [32] 22.9 45.2 49.4 61.2 2.50
(MAPE) and Error Bound Accuracy (Acc(δ)), the same F LOWER F ORMER 18.8 44.3 49.5 64.7 2.25