0% found this document useful (0 votes)
11 views

Let There Be Order: Rethinking Ordering in Autoregressive Graph Generation

The paper addresses the challenges of ordering in autoregressive graph generation, proposing a novel framework that treats ordering as a dimensionality reduction problem. It introduces 'latent sort,' a learning-based ordering scheme that improves the accuracy of generated graphs by optimizing the sequence of graph tokens. Experimental results demonstrate the effectiveness of latent sort across various graph generation tasks, highlighting the importance of ordering in enhancing model performance.

Uploaded by

xiaohang007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Let There Be Order: Rethinking Ordering in Autoregressive Graph Generation

The paper addresses the challenges of ordering in autoregressive graph generation, proposing a novel framework that treats ordering as a dimensionality reduction problem. It introduces 'latent sort,' a learning-based ordering scheme that improves the accuracy of generated graphs by optimizing the sequence of graph tokens. Experimental results demonstrate the effectiveness of latent sort across various graph generation tasks, highlighting the importance of ordering in enhancing model performance.

Uploaded by

xiaohang007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Let There Be Order: Rethinking Ordering in

Autoregressive Graph Generation

Jie Bu Kazi Sajeed Mehrab


Department of Computer Science Department of Computer Science
Virginia Tech Virginia Tech
arXiv:2305.15562v1 [cs.LG] 24 May 2023

Blacksburg, VA 24060 Blacksburg, VA 24060


[email protected] [email protected]

Anuj Karpatne
Department of Computer Science
Virginia Tech
Blacksburg, VA 24060
[email protected]

Abstract
Conditional graph generation tasks involve training a model to generate a graph
given a set of input conditions. Many previous studies employ autoregressive mod-
els to incrementally generate graph components such as nodes and edges. However,
as graphs typically lack a natural ordering among their components, converting a
graph into a sequence of tokens is not straightforward. While prior works mostly
rely on conventional heuristics or graph traversal methods like breadth-first search
(BFS) or depth-first search (DFS) to convert graphs to sequences, the impact of
ordering on graph generation has largely been unexplored. This paper contributes
to this problem by: (1) highlighting the crucial role of ordering in autoregressive
graph generation models, (2) proposing a novel theoretical framework that per-
ceives ordering as a dimensionality reduction problem, thereby facilitating a deeper
understanding of the relationship between orderings and generated graph accuracy,
and (3) introducing "latent sort," a learning-based ordering scheme to perform
dimensionality reduction of graph tokens. Our experimental results showcase the
effectiveness of latent sort across a wide range of graph generation tasks, encourag-
ing future works to further explore and develop learning-based ordering schemes
for autoregressive graph generation.

1 Introduction
We consider the problem of generating graphs given conditional inputs, where there exists precisely
one target graph we are interested in generating for every conditional input. We call this problem as
paired graph generation, since every input condition is paired with exactly one target graph. This
setting is highly relevant in many real-world applications such as predicting the road network of
a region given its satellite image [1], generating scene graphs with semantic relationships among
objects given an image of the objects [2, 3, 4], and inferring the underlying circuit graph of an
electrical instrument given its sensor observations [5]. This is different from the unpaired setting of
graph generation problems studied by several previous works, where we are interested in generating
distributions of graphs rather than a specific target graph.
The problem of graph generation has some similarity to natural language generation [6, 7, 8], since
they both involve generating outputs with variable sizes. However, a key distinction is that there is no

Preprint. Under review.


natural ordering of graph components (e.g., nodes and edges) as opposed to the sequential nature of
words in a sentence. Graph generation is thus a particularly challenging problem due to the inherent
diversity of graphs involving varying sizes and different topologies. Many previous approaches for
graph generation [9, 10, 11, 1, 12] have adopted autoregressive models such as recurrent neural
networks [13, 14] and auto-regressive transformers [15, 16, 17] to handle the variable sizes and
topologies in graphs. These models often convert graph components into vector representations or
tokens, and incrementally generate tokens representing graph components in an ordered sequence.
However, existing autoregressive models for graph generation face unique challenges in sorting
multi-dimensional graph tokens into ordered sequences. As graph components often have no natural
ordering [18], there can be n! possible orderings for n graph tokens. It has been shown by [19, 20, 11]
that finding an optimal ordering of unsorted elements in a set is key to the success of autoregressive
models, since certain orderings may be favored by the models leading to better performance. Despite
this, existing works on autoregressive graph generation [10, 11, 1, 12] make arbitrary choices for
ordering graph components based on graph traversals such as breadth-first-search (BFS) or depth-
first-search (DFS) without fully justifying their choices empirically or theoretically.
In this work, we present a novel perspective to address this issue by reframing the problem of ordering
a set of graph tokens as that of performing dimensionality reduction (DR). Specifically, by learning a
generalizable mapping of every possible graph token to a 1-D representation, we theoretically and
empirically show that a set of graph tokens can be conveniently ordered by sorting their 1-D values.
Our contributions can be summarized as follows:
• We develop novel theoretical frameworks to study the accuracy of DR techniques for sorting
unordered tokens in autoregressive models, opening a new line of research in developing
learning-based ordering schemes for paired graph generation.
• We propose the latent sort algorithm, a novel learning-based ordering scheme employing
DR for autoregressive graph generation. We provide theoretical bounds of the errors of
latent sort and highlight an interesting connection between performing latent sort and finding
approximate solutions to the shortest path problem.
• We propose a strong backbone network for autoregressive graph generation named Graph
Auto-Regressive Transformer (GraphART) based on a modified GPT-2 architecture [17].
GraphART is purely autoregressive in nature and does not use any graph-specific architec-
tural components such as graph neural networks (GNNs), in contrast to prior works.
• We empirically show that different token ordering schemes result in nontrivial performance
gaps in autoregressive models such as GraphART. We also show that Latent Sort is versatile
and achieves competitive performance over a wide range of graph generation tasks.

2 Related Works
A number of approaches have been developed for graph generation in a probabilistic setting where the
goal is to generate a distribution of graphs that resemble a target distribution given conditional inputs
[9, 10, 21, 22, 23, 11, 24, 25, 26]. These works fall in the “unpaired” graph generation category, since
there is no unique ground-truth target graph that is paired with every input condition. Hence, the
evaluation of performance of these works typically focuses on measuring the validity of the generated
graphs in terms of statistical similarity with the target distribution, rather than measuring deviation
from its corresponding ground-truth graph in a “paired” fashion, which is the focus of our work.
One line of work for paired graph generation involves using non-autoregressive approaches to
generate graphs [27, 28, 29, 3, 4]. These approaches generate entire graphs rather than individual
components, which are then matched to the target graphs using bipartite matching algorithms (e.g.,
Hungarian matching in DETR [30] for object detection). However, bipartite matching loss functions
can slow down training due to their high computational costs [31] and can potentially suffer from
slow convergence [32, 33]. Moreover, these models cannot direclty handle output graphs with varying
sizes and have to resort to padding, making them inefficient for real-world applications.
Another line of work involves autoregressive models to generate variable-length sequences of graph
tokens. As a necessary preprocessing step, these methods require an approach to sequentialize target
graphs into an ordered sequence of tokens, where finding the ideal scheme for token ordering can
be non-trivial. Vinyals et al. [19] were the first to reveal the importance of ordering when using

2
sequence-to-sequence models to process sets. They proposed a training algorithm that searches for
the optimal order, but scalability issues would arise with large sets, and inexact searches or sampling
methods could negatively impact performance. More recently, Chen et al. [20] discuss the importance
of ordering for unpaired autoregressive graph generation, where they derive the joint probability
over the graph for sorting the nodes. To sequentialize graphs, most of the existing works favor
traditional graph traversal methods like BFS [10, 11] or DFS [34, 12, 11] . However, to the best
of our knowledge, no prior work has theoretically and empirically compared the performance of
autoregressive models for graph generation with varying token ordering schemes, which is one of the
contributions of our work.

3 Rethinking Sorting As a Dimensionality Reduction Problem


Given a set of M unordered points (or vector tokens) X ⊆ RN , |X | = M in an N -dimensional
space (N > 1), we are interested in finding the “optimal” ordering of points in X , denoted by
Y ∗ ∈ RM ×N , which when used as the target sequence to supervise an autoregressive model yields
optimal performance in generating X . Formally, let us denote the ordered sequence Y ∗ as the matrix
[x∗1 , x∗2 , ..., x∗M ] , where every row x∗i of Y ∗ is a point in X and i denotes its sorted index in the

optimal ordering. For now, we assume that such an ordering exists for every set X . Later, in Section
3.3, we will describe some of the ideal properties of Y ∗ .
We are interested in finding the optimal ordering Y ∗ by sorting high-dimensional vector tokens in X .
This is challenging due to the absence of a well-defined comparison operation in the N -dimensional
space of vector tokens, resulting in a factorial number of possible orderings to be evaluated that is
computationally prohibitive. To tackle this issue, we consider sorting points in X as a dimensionality
reduction (DR) problem, as described in the following.

3.1 DR-based Sorting Algorithms

Figure 1 provide an DR Maping h1 h2 h3 h4 h5 i3 i1 i4 i2 i5


x1
f Argsort
overview of the generic x2
x3
1D Representation Sorted Index
x4
DR-based sorting pipeline x1 x2 x3 x4 x5
for any DR mapping f . x5
Ordering Using Indices
Tokenize
The key insight that we
leverage here is that while Target Graph
Unordered Tokens x3’ x1’ x4’ x2’ x5’ x3 x1 x4 x2 x5
sorting points is ill-defined Loss
Autoregressive
in N -dimensional space, Graph Generator Function
it is well-defined on a 1-D Input Conditions Generated Sequence Target Sequence
space (see Remark E.1 in
Appendix). Hence, if we Figure 1: A general pipeline for sorting algorithms using dimensional-
can learn a generalizable ity reduction (DR) mapping f .
DR mapping from every
possible token x to a corresponding 1D representation h, we can order tokens in any arbitrary set X
by sorting their 1D representations. We can thus define DR-based sorting algorithms as follows.
Definition 3.1 (DR-based Sorting Algorithm). Given a sorting algorithm s : X → Y , let Y =
[x1 , x2 , ..., xM ] be the resulting ordered sequence for the unordered set X ⊆ RN . We can then

represent the sorting algorithm s by a DR mapping f : RN → R, such that the ordering in the original
N -D space is given by sorting their 1-D values, hi = f (xi ) ∈ R, and ∀i ≤ j ≤ M, f (xi ) ≤ f (xj ).

Note that the DR mapping f maps every token xi to a universal position hi in the latent space,
regardless of the combination of tokens in X that it appears with. Ordering X thus reduces to finding
a sorted path traversal in the 1D latent space. Thus, we formulate the problem of finding the optimal
sorting s∗ : X → Y ∗ as finding an optimal DR mapping f ∗ : RN → R, h∗i = f ∗ (x∗i ). While sorting
itself is challenging to perform in a differentiable way, finding differentiable approaches for DR is
relatively easier. This opens up new possibilities for developing learning-based methods to order
points in high-dimensional spaces. For simplicity, we assume the 1-D representations of all the points
in X , denoted by H, are normalized to [0, 1].
Latent Sort Algorithm: We propose latent sort as our DR-based sorting algorithm, where the DR
mapping f is represented by an auto-encoder model consisting of an MLP encoder fe and an MLP

3
decoder fd . We first train fe and fd to reconstruct all tokens in the dataset. We then freeze fe and
plug it in the pipeline shown in Figure 1 as the DR mapping f for sorting. The property of latent sort
will be analyzed in following sections.

3.2 Analyzing The Errors of DR-based Sorting Algorithms

To quantify the errors in a sorting algorithm s, we introduce a probability matrix P ∈ RM ×M where


each entry pij of P represents the probability of xi being x∗j . Thus, we have E[Y ] = P Y ∗ , and P
can be viewed as a soft permutation matrix that permutes Y ∗ into Y . We use the Frobenius norm of
the difference between E[Y ] and Y ∗ as the expected error in a sorting algorithm,
 
2 2
E E[Y ], Y ∗ = E[Y ] − Y ∗ F = P Y ∗ − Y ∗ F , (1)

where ∥·∥F denotes the Frobenius norm. We can see that two factors contribute to the error term E: P
and Y ∗ . It is obvious how P affects the error, as P measures the deviation from the optimal ordering
Y ∗ , which reflects the property of an ordering method. We show that minimizing E E[Y ], Y ∗ is


equivalent to minimizing ∥P − IM ∥2F (see Lemma D.1 in the Appendix). On the other hand, it is
less straight-forward to see how the choice of Y ∗ affects the error, which we will discuss in Section
3.3. Here, we give two sources of errors that can lead to suboptimal P .
Errors from Ordering Ambiguity: The first Mean Squared Summation
source of errors is the ordering ambiguity. An 1.0 1.0

ideal mapping f ∗ is one that does not result in 0.8 0.8


any ambiguity in sorting, i.e., f ∗ (xi ) = f ∗ (xj ) 0.6 0.6
iff i = j. However, such an ideal mapping only
y

y
exists for certain conditions (see Remark E.3 0.4 0.4

in Appendix), otherwise f is surjective (non- 0.2 0.2


bijective) even if it has no reconstruction errors.
0.0 0.0
Thus, it is possible that a DR mapping f assigns 0.0 0.5 1.0 0.0 0.5 1.0
the same 1-D representation to different points x x
in X , which can cause ambiguity in sorting these Figure 2: Illustration of the ordering ambiguity
points. To describe this ambiguity, we introduce on 2D points within [0, 1]. Colored contour lines
the concept of ordering ambiguity set. represent points with same 1D latent value, i.e.,
Definition 3.2 (Ordering Ambiguity Set). Let they belong to the same ordering ambiguity set.
Ai be the ordering ambiguity set for a point
xi ∈ X , where Ai = {xj |∀xj ∈ X , f (xi ) = f (xj )}.

Ordering ambiguity occurs when there exists more than one element in the ordering ambiguity set,
i.e., |Ai | > 1. Different DR mappings , can have varying ordering ambiguity patterns. Figure
2 shows ordering ambiguity patterns in 2D space for two DR mappings: (i) Mean Squared Sort:
hi = (x2i1 + x2i2 )/2, (ii) Summation Sort: hi = xi1 + xi2 .
To quantify the error introduced by ordering ambiguity, we consider the expectation of the ambiguity
set Ai for each point xi ∈ X . It is reasonable to assume that each row i of P is a uniform distribution
over the corresponding ambiguity set Ai (i.e., ∀x∗j ∈ Ai , pij = 1/|Ai | and ∀x∗j ̸∈ Ai , pij = 0.
Putting this in Equation (1), we get (see Section C.3 in Appendix for derivation):
 
M 2


 X 1 X
E E[Y ], Y =  xj − xi
 (2)
|Ai |
i=1 xj ∈Ai

Errors From Imperfect Reconstruction In Latent Sort: Since latent sort trains an autoencoder
using reconstruction loss for DR mapping, it is beneficial to discuss the potential errors from imperfect
reconstructions (See Section C.4 and C.5 in the Appendix for details). Imperfect reconstruction
leads to a discrepancy between the reconstructed token x̂ and the original token x. Assuming the
reconstruction follows normal distribution, the reconstruction error will also result in a normal
distributed error with bounded mean and variance in the latent space (See Theorem C.3 in Appendix).
This error can affect the ordering results when two points swap their positions in the sequence
due to their 1-D latent values crossing each other. The probability of swap can be quantified by
examining the overlap between their 1-D latent distributions. Since the 1-D latent values follow

4
Gaussian distribution, and the overlaps of normal distributions are very small if the mean values of
the distributions are far from each other, we only consider the swapping between neighboring points
since the non-neighboring points will have larger mean value differences and thus are less likely to
be swapped (See Section C.6 in Appendix for details).

3.3 The Shortest Path Property of Desirable Ordering Schemes

In previous theories, we have discussed the error of sorting algorithms using the probability matrix P .
In this section, we discuss the desired properties of Y ∗ that minimize the errors induced by given P
matrices.
From Section C.7 in the Appendix, we have the simplified equation for error E as follows:

  M −1
X 2
E E[Y ], Y ∗ = pi(i−1) (x∗i−1 − x∗i ) + pi(i+1) (x∗i+1 − x∗i ) F
(3)
i=2

Recall that f maps each point in X to its 1D latent representation independently of other points.
Therefore, it is unaware of the global information of the entire sequence. That being said,
 we expect
(3) to be minimized for any three points {x∗i−1 , x∗i , x∗i+1 } ⊂ X . To make E E[Y ], Y ∗ minimized
for all possible X ⊆ RN , we need to minimize the upper bound of (3), giving by:
M −1
X 2
pi(i−1) (x∗i−1 − x∗i ) + pi(i+1) (x∗i+1 − x∗i ) F
(4)
i=2
M −1 M −1
X 2 X 2
≤ pi(i−1) (x∗i−1 − x∗i ) F
+ pi(i+1) (x∗i+1 − x∗i ) F
, (5)
i=2 i=2

assuming both pi(i−1) and pi(i+1) are nonzero for each i. In order to minimize errors from imperfect
P , the ideal sorting Y ∗ should minimize (x∗i−1 − x∗i ) and (x∗i+1 − x∗i ), i.e., minimize the distances
between neighboring pairs. This property connects to the shortest path problem or traveling salesman
problem (TSP), with the target sorting being the TSP solution on the input point set X . However, TSP
is NP-hard and solving it is computationally expensive. Using TSP solutions also induces difficulties
for the autoregressive models to capture the ordering rule represented by the TSP solving algorithms.
Thus, an approximation to the TSP solution should provide a balanced option between simplicity for
learning and error tolerance.
Theorem C.1 and C.2 in the Appendix state the correlation between the distance between two points
and their 1-D latent distance. Since neighboring points in the sorted sequence have the closest latent
values, the distance between neighboring points in the sorted sequence are expected to be small
when the decoder is properly regularized. However, our experiments suggest that reconstruction
loss and regularization techniques alone may not sufficiently approximate shortest path solutions.
To further enhance the shortest path property in the latent sort algorithm, we introduce the latent
gradient penalty (LGP) loss given by:
−1 M −1
M
(
∥xi −xj ∥
X X
2 − α if |i − j| = 1,
LGP(X , fe ) = min gij , where gij = |hi −hj |+β (6)
fe
i=2 j=2
0 otherwise

where fe is the latent sort encoder, and α and β are positive constants. The LGP aims to make the
1-D latent value distance |hi − hj | between two neighboring points proportional to their distance
∥xi − xj ∥ in the original space, with proportionality constant constrained by α (we chose α = 1).
Without extra justification, the latent sort algorithm in the paper are by default trained with LGP.

4 Experiment Setups

We briefly summarize the experiment setups in this section (see Appendix B for complete details of
the processing pipelines for full reproducibility). All codes are publicly on Github 1 .

5
Output

Input Autoregressive
Encoder Decoder
+
LayerScale
Input Image Scene Graph

Decoder Block
Feed
Forward

Input Autoregressive LayerNorm


Encoder Decoder +
Input Image Topological Graph LayerScale

Multi-head
Attention

Input Autoregressive LayerNorm


Encoder Decoder

Circuit Specs Circuit Graph


Input Target

(a) Paired Graph Generation Applications and Pipelines. (b) GraphART Decoder Block
Figure 3: Left: schematic views of the three paired graph generation tasks (from top to bottom):
scene graph generation [3], topological graph extraction [1], and circuit graph prediction [5]. Right:
The decoder block of the proposed GraphART model using multi-head causal attention architecture.

4.1 Application Tasks and Datasets


We consider the following three application tasks for paired graph generation (see Figure 3a).
Scene Graph Generation: The goal of this task is to take an image as input and generate a scene
graph as output. We use the OpenPSG dataset [3], which consists of semantic-level triplets including
two objects (nodes) and predicates (edges) between them.
Topological Graph Extraction: The goal here is to predict the spatial location of nodes in a
graph and the undirected edges between them, given an image representation of the graph. We use
the Toulouse Road Network (TRN) dataset [1] and our synthetic Planar Graph dataset, which can
randomly generate image-graph pairs that are more challenging than the ones in the TRN dataset. We
average the results over 10 random runs for every method on the Planar Graph dataset.
Circuit Graph Prediction: For this task, we consider the Terahertz channelizer dataset [5], which
comprises 347k circuit graphs. The goal is to predict the ground-truth circuit graph comprises of 3 to
6 resonators, given desirable electromagnetic (EM) properties as inputs. Each resonator is represented
by a 9-dimensional vector. Since the scale of some dimensions in the resonator vectors is much
larger than others, which may dominate the prediction errors, we normalize the data using min-max
normalization per dimension. We report the errors on the normalized data.

4.2 Models and Pipelines

Additional details for the baseline models can be found in Section B in the Supplementary.
Backbone Models: We propose two backbone models for paired graph generation: GraphART and
GraphLSTM. These models rely solely on their autoregressive nature to sequentially predict graph
tokens, without involving graph-specific architecture components such as graph neural networks
(GNNs). Both models consist of an encoder to encode the input conditions (e.g., images) and an
autoregressive decoder for graph generation. The decoder of GraphART is based on our modified
version of the GPT-2 architecture, as shown in Figure 3b. It incorporates pre-normalization [35]
and LayerScale [36] designs. In GraphLSTM, we simply replace the autoregressive transformer of
GraphART with a long short-term memory (LSTM) model [13]. These models can be coupled with
any sorting algorithm for ordering graph tokens during training.
Baseline Sorting Algorithms: (1) Mean Squared Sort: n-dimensional tokens are sorted based on
the mean squared value of their n dimensions. This heuristic sorts the graph tokens based on their
overall magnitudes, from larger to smaller L2 norms. (2) Lexicographical Sort: We start by sorting
tokens based on their first dimension, then the second dimension if there are ties, and so on until
1
https://github.com/jayroxis/ordering-in-graph-generation

6
all dimensions have been considered. Note that this heuristic assumes a meaningful ordering of
dimensions in the data, which may not be true. (3) SVD Low-Rank Sort: We perform a Singular
Value Decomposition (SVD) of the tokens in a graph and sort the tokens by projecting them onto
the direction of maximum variance, identified using the largest principal component. This serves
as a linear baseline for dimensionality reduction, as opposed to the non-linear Latent Sort. (4) BFS
or DFS Sort: We use the breadth-first search (BFS) or depth-first search (DFS) traversal on edges,
implemented by NetworkX [37].
Graph Tokenization Pipeline: To address the potential disadvantages of transformers attending
different token distributions for each category of graph components (e.g., nodes and edges), as
advocated in [36], we implemented a novel edge-based tokenization approach. This approach is
similar to, but distinct from, the method used in Graph Transformers [38, 39]. Specifically, whenever
possible, we concatenated the features of each edge with graph-level features and the features of the
two connecting nodes to create edge-based tokens. For example, in the topological graph extraction
task where only node coordinates are available as features, we constructed edge tokens by combining
the coordinates of the two connecting nodes for each edge. In the scene graph generation task,
we followed the convention of using one-hot encoded triplets as edge tokens. In the circuit graph
prediction task, we utilized a 9-dimensional vector as a token to represent each node (resonator), as
employed in [5].
Loss Functions: We trained all autoregressive models using a teacher forcing setup. For directed
graphs, we calculated the loss between each element in the ground-truth sequence Ygt and its
corresponding element in the predicted sequence Ypred using an appropriate choice of loss function
L(Ypred , Ygt ). For circuit extraction, we employed the elastic loss function, which is the sum of L1
and MSE loss, as L. In scene graph generation, we used the binary cross-entropy (BCE) loss instead
of cross-entropy (CE) loss, following previous works [40, 41]. For the topological graph extraction
task involving undirected graphs, we utilized the elastic loss as L, but we considered its undirected
variant Lu . This variant accounts for swapping the node features in every edge token. The undirected
graph loss is defined as follows:Lu (Ypred , Ygt ) = L(Ypred , Ygt ) + L(Ypred , Y gt ), where Y gt is the
sequence where the two node features in every edge token are swapped.
Evaluation Metrics: During testing, we employed the Edge Mover distance (EMD), Edge Hausdorff
distance (EHD), and StreetMover Distance (SMD) from [1] as evaluation metrics to assess the
similarity between predicted graphs and ground-truth graphs. The EMD calculates the average of the
minimum distances between each point in the sequence Ypred and its nearest neighbor in the set Ygt .
The EHD measures the maximum distance between the closest points of the two sequences. On the
other hand, the SMD is based on point cloud distances. Further details on these metrics can be found
in Supplementary B. For scene graph generation, we consider precision, recall, and F-1 scores as
the predicted sequences consist of multi-class labels. Additionally, we report the differences in sizes
between the predicted graphs and the target graphs as another metric in each application.

5 Experimental Results

5.1 Comparing Performance on Paired Graph Generation Tasks

Tables 1 and 2 compare the performance


of the Latent Sort algorithm with base- Table 1: Toulouse Road Network Generation.
line sorting algorithms and models on
the road network generation and planar Model Sorting / Matching SMD [1]
graph generation tasks, respectively. The GraphART Latent Sort 0.0075
top-2 models are highlighted in bold in GraphART Mean Squared Sort 0.0081
all tables in this paper. We observe GraphART Lexicographical Sort 0.0422
that GraphART coupled with Latent Sort GraphART SVD Low-Rank Sort 0.0751
consistently demonstrates superior per- GraphART BFS Sort 0.0893
formance for topological graph extrac- GraphART DFS Sort 0.0761
tion compared to baseline methods such GGT [1] BFS Sort 0.0158
as GGT [1], GraphRNN [1, 10], and GraphRNN [1, 10] BFS Sort 0.0245
GraphTR. This is noteworthy as GGT
and GraphRNN utilize graph-specific ar-
chitectures to predict adjacency matrices in addition to graph generation, unlike our approach. These

7
Table 2: Planar Graph Generation.
Model Sorting / Matching EMD EHD SMD Size. Diff.
GraphART Latent Sort 0.0038 0.0166 0.0002 0.0021
GraphART Mean Squared Sort 0.0031 0.0151 0.0002 -0.0360
GraphART Lexicographical Sort 0.0141 0.0564 0.0013 -0.2434
GraphART SVD Low-Rank Sort 0.0381 0.0951 0.0068 4.8280
GraphART BFS Sort 0.0217 0.0579 0.0017 1.7296
GraphART DFS Sort 0.0213 0.0572 0.0017 1.6399
GraphLSTM Latent Sort 0.0100 0.0293 0.0006 0.1639
GraphLSTM Mean Squared Sort 0.0063 0.0211 0.0004 0.0060
GraphLSTM Lexicographical Sort 0.0248 0.0609 0.0030 3.9318
GraphLSTM SVD Low-Rank Sort 0.0480 0.0940 0.0069 4.5328
GraphLSTM BFS Sort 0.0446 0.0888 0.0059 3.6330
GraphLSTM DFS Sort 0.0391 0.0738 0.0045 1.7268
GraphTR Hungarian Matcher 0.0069 0.0348 0.0003 -0.0130

results showcase the effectiveness of using purely autoregressive models for graph generation when
combined with an appropriate graph token ordering scheme like Latent Sort.
Among the various sorting algorithms used with GraphART, Mean Squared Sort stands out for its
impressive performance on both datasets, despite being a simple heuristic. This could potentially be
attributed to its simplicity in sorting based on token magnitudes, which the autoregressive models can
effectively capture. It highlights the importance of ordering schemes being "easy to learn" which is
commonly neglected. On the other hand, BFS and DFS sort consistently exhibit inferior performance
on both datasets for paired graph generation, despite being the de facto convention used in existing
methods for probabilistic graph generation. We hypothesize that although BFS and DFS can generate
viable distributions of graphs (e.g., molecules) in a probabilistic setting, they perform poorly when an
exact match with a target graph is required. This is because they do not consider the values of graph
tokens, but only their relative positions in the graph, which can be arbitrarily defined. Moreover, the
autoregressive model needs to understand the global structure of the graph to determine the sequence
ordering, adding significant complexity in learning the ordering. Additionally, we observe that
GraphLSTM generally does not perform as well as GraphART. This could be due to the advantage of
transformers, which are capable of attending to longer context that help it gain better understanding
on global information in graphs compared to LSTMs.

Table 3: Semantic Scene Graph Generation on OpenPSG [3] Dataset .


Model Sorting / Matching Additional Labels Precision Recall F1-Score
GraphART Latent Sort None 0.2537 0.1724 0.2053
GraphART Lexicographical Sort None 0.2425 0.1867 0.2110
GraphART SVD Low-Rank Sort None 0.2496 0.1696 0.2020
GraphLSTM Latent Sort None 0.0115 0.0094 0.0104
GraphLSTM Lexicographical Sort None 0.0110 0.0093 0.0101
GraphLSTM SVD Low-Rank Sort None 0.0099 0.0079 0.0088
PSGTR [3] Hungarian Matcher Panoptic Segmentation 0.1810 0.2116 0.1920
PSGFormer [3] Query Matching Block Panoptic Segmentation 0.0438 0.0520 0.0467

Tables 3 and 4 present the results for the scene graph generation and circuit graph prediction tasks,
respectively. Both tasks involve directed graphs. It is important to note that the mean squared sort
cannot be applied to scene graph generation due to the one-hot encoding of graph tokens, which leads
to identical mean squared values for all tokens. Additionally, the BFS and DFS sorting algorithms
could hardly be applied to both datasets as they mostly consist of disconnected small graphs. In the
scene graph generation task, we observe that GraphART exhibits higher precision compared to the
state-of-the-art (SOTA) baselines PSGTR and PSGFormer for all sorting algorithm choices, where
PSGTR [3] and PSGFormer [3] use additional panoptic segmentation labels for training while we
only rely on the scene graph labels. However, it has lower recall than PSGTR. It is worth mentioning
that PSGTR has an advantage over our method as it utilizes additional supervision from panoptic
segmentation during training and knows the ground-truth sizes of target graphs as user-specified

8
Table 4: Results on Circuit Graph Prediction [5].
Model Token Sorting / Matching EMD EHD Size. Diff.
GraphART Latent Sort 0.0509 0.0856 0.6320
GraphART Mean Squared Sort 0.0558 0.0860 0.0794
GraphART Lexicographical Sort 0.0623 0.1080 1.4757
GraphART SVD Low-Rank Sort 0.0590 0.0863 1.7485
GraphLSTM Latent Sort 0.0496 0.0757 0.4249
GraphLSTM Mean Squared Sort 0.0576 0.0836 -0.1305
GraphLSTM Lexicographical Sort 0.0528 0.0803 -0.1205
GraphLSTM SVD Low-Rank Sort 0.0516 0.0777 0.6112
Circuit-GNN [5] N/A 0.0738 0.0943 (Set to GT)

hyperparameters during testing. Despite this, GraphLSTM shows better F-1 scores than PSGTR for
multiple sorting algorithms. Moreover, we observe that lexicographical sort performs the best on this
dataset, followed closely by Latent Sort. Using lexicographical sort as a domain-specific heuristic for
this problem aligns well with the categorical nature of graph tokens. We include additional results in
Section A in the Supplementary, including the discussion of label noises in the dataset that lead to
size differences between GraphART generated scene graph vs the ground-truth labels. For the circuit
graph prediction task, both GraphART and GraphLSTM models outperform Circuit-GNN. Among
the ordering schemes, Latent Sort exhibits a slight advantage over others.
In summary, while specific heuristics may excel in certain applications (e.g., mean squared sort
in topological graph generation and lexicographical sort in scene graph generation), Latent Sort
demonstrates versatility and consistently delivers competitive performance across all four datasets.

5.2 Analyzing Sorting Ambiguity and Shortest Path Property of Latent Sort

To analyze the importance of us- Without LGP With LGP


1.0 1.0
ing latent gradient penalty (LGP) 50
in the training of 1-D represen- 0.8 0.8 40
30

Latent Values
tations by latent sort, Figure 4 0.6 0.6 20
shows the contour lines in the 10
0
1-D space learned by latent sort 0.4 0.4
−10
for a toy 2-D dataset, with and 0.2 0.2 −20
without LGP. The latent space −30
0.0 0.0 −40
learned without LGP exhibits 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
large regions with identical val-
ues, which, according to the the- Figure 4: 1-D representation of 2-D points learned by latent sort
ory of ordering ambiguity, can with (right) and without (left) using latent gradient penalty (LGP).
lead to significant errors. In con- White lines are contours showing regions with same latent values.
trast, the latent space learned
with LGP demonstrates finer partitioning, resulting in smaller regions sharing the same values. This
leads to reduced ambiguity in ordering and, consequently, lower errors due to ordering ambiguity.

Latent Sort (N = 25) Latent Sort (N = 50) Latent Sort (N = 75) Latent Sort (N = 100)
1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0


0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

Figure 5: Approximating shortest path solutions using latent sort (with LGP).

Moreover, to empirically support the theoretical claim that latent sort finds an approximate solution
to the shortest path problem, Figure 5 showcases the paths identified by latent sort over different

9
collections of points in 2-D space. These paths are determined by sorting the points based on their 1-D
values. The examples demonstrate that latent sort reasonably approximates the shortest path solution,
even for a large number of points. For additional quantitative results, please refer to Appendix A.
These findings highlight an intriguing connection between Latent Sort and shortest path problems,
motivating a new use-case for Latent Sort as a fast, low-cost method for approximating shortest path
solutions through GPU-based parallelization of neural networks. This is particularly beneficial for
applications with strict runtime efficiency requirements.

6 Conclusions
We presented a novel autoregressive framework for paired graph generation by reframing the task of
sorting unordered tokens as a dimensionality reduction problem from a theoretical standpoint. We
showed the efficacy of our proposed latent sort algorithm on various graph generation tasks. However,
we also acknowledge certain limitations of our work. While latent sort performs competitively across
a wide range of datasets, it does not consistently outperform application-specific heuristics. This
suggests the need for future works to explore more sophisticated learning-based ordering schemes.
Additionally, our experiments are currently limited to small graphs, emphasizing the importance of
investigating the scalability of our approach to larger graphs in future research.

References
[1] Davide Belli and Thomas Kipf. Image-conditioned graph generation for road network extraction.
arXiv preprint arXiv:1910.14388, 2019.
[2] Yichao Lu, Himanshu Rai, Jason Chang, Boris Knyazev, Guangwei Yu, Shashank Shekhar,
Graham W Taylor, and Maksims Volkovs. Context-aware scene graph generation with seq2seq
transformers. In Proceedings of the IEEE/CVF international conference on computer vision,
pages 15931–15941, 2021.
[3] Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. Panoptic
scene graph generation. In ECCV, 2022.
[4] Rongjie Li, Songyang Zhang, and Xuming He. Sgtr: End-to-end scene graph generation with
transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 19486–19496, 2022.
[5] Guo Zhang, Hao He, and Dina Katabi. Circuit-gnn: Graph neural networks for distributed
circuit design. In International Conference on Machine Learning, pages 7364–7373, 2019.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[7] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin,
Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to
follow instructions with human feedback. Advances in Neural Information Processing Systems,
35:27730–27744, 2022.
[8] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece
Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general
intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[9] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep
generative models of graphs. arXiv preprint arXiv:1803.03324, 2018.
[10] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generat-
ing realistic graphs with deep auto-regressive models. In International conference on machine
learning, pages 5708–5717. PMLR, 2018.
[11] Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Will Hamilton, David K Duvenaud, Raquel
Urtasun, and Richard Zemel. Efficient graph generation with graph recurrent attention networks.
Advances in neural information processing systems, 32, 2019.
[12] Chia-Cheng Liu, Harris Chan, Kevin Luk, and AI Borealis. Auto-regressive graph generation
modeling with improved evaluation methods. In 33rd Conference on Neural Information
Processing Systems. Vancouver, Canada, 2019.

10
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[14] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-
decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[16] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language
understanding by generative pre-training. OpenAI blog, 2018.
[17] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[18] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large
graphs. Advances in neural information processing systems, 30, 2017.
[19] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to sequence for
sets. arXiv preprint arXiv:1511.06391, 2015.
[20] Xiaohui Chen, Xu Han, Jiajing Hu, Francisco JR Ruiz, and Liping Liu. Order matters: Prob-
abilistic modeling of node sequence for graph generation. arXiv preprint arXiv:2106.06189,
2021.
[21] Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. Netgan:
Generating graphs via random walks. In International conference on machine learning, pages
610–619. PMLR, 2018.
[22] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Constrained graph
variational autoencoders for molecule design. Advances in neural information processing
systems, 31, 2018.
[23] Tengfei Ma, Jie Chen, and Cao Xiao. Constrained generation of semantically valid graphs via
regularizing variational autoencoders. Advances in Neural Information Processing Systems, 31,
2018.
[24] Carl Yang, Peiye Zhuang, Wenhan Shi, Alan Luu, and Pan Li. Conditional structure generation
through graph variational generative adversarial nets. Advances in neural information processing
systems, 32, 2019.
[25] Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. Score-based generative modeling of graphs
via the system of stochastic differential equations. In International Conference on Machine
Learning, pages 10362–10383. PMLR, 2022.
[26] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal
Frossard. Digress: Discrete denoising diffusion for graph generation. International Conference
on Learning Representations (ICLR 2023), 2022.
[27] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov,
and Alexander J Smola. Deep sets. Advances in neural information processing systems, 30,
2017.
[28] Yan Zhang, Jonathon Hare, and Adam Prugel-Bennett. Deep set prediction networks. Advances
in Neural Information Processing Systems, 32, 2019.
[29] Adam R Kosiorek, Hyunjik Kim, and Danilo J Rezende. Conditional set generation with
transformers. arXiv preprint arXiv:2006.16841, 2020.
[30] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and
Sergey Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV
2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16,
pages 213–229. Springer, 2020.
[31] Roy Jonker and Ton Volgenant. A shortest augmenting path algorithm for dense and sparse
linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in
Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR,
pages 622–622. Springer, 1988.

11
[32] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris M Kitani. Rethinking transformer-based
set prediction for object detection. In Proceedings of the IEEE/CVF international conference
on computer vision, pages 3611–3620, 2021.
[33] Gongjie Zhang, Zhipeng Luo, Yingchen Yu, Kaiwen Cui, and Shijian Lu. Accelerating detr
convergence via semantic-aligned matching. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 949–958, 2022.
[34] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for
molecular graph generation. In International conference on machine learning, pages 2323–2332.
PMLR, 2018.
[35] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang,
Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture.
In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
[36] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Go-
ing deeper with image transformers. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, pages 32–42, 2021.
[37] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics,
and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod Millman, editors,
Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA, 2008.
[38] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen,
and Tie-Yan Liu. Do transformers really perform badly for graph representation? Advances in
Neural Information Processing Systems, 34:28877–28888, 2021.
[39] Jinwoo Kim, Dat Nguyen, Seonwoo Min, Sungjun Cho, Moontae Lee, Honglak Lee, and Se-
unghoon Hong. Pure transformers are powerful graph learners. Advances in Neural Information
Processing Systems, 35:14582–14595, 2022.
[40] Lucas Beyer, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord.
Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
[41] Ross Wightman, Hugo Touvron, and Hervé Jégou. Resnet strikes back: An improved training
procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
[42] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[43] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural
networks, 2020.
[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition, 2015.
[45] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.
ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision
(IJCV), 115(3):211–252, 2015.
[46] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale, 2021.
[47] Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural
networks, 2019.

12
A Additional Experiment Results
In this section, we present additional experiment results from our study, offering further insights into
the aspects and factors contributing to our proposed framework.

A.1 Approximating TSP Solutions Using Latent Sort

To investigate the performance of latent sort algorithms in approximating the Traveling Salesman
Problem (TSP), we present additional results using a synthetic 2-D dataset in R2 . All points in the
dataset are generated using a uniform distribution, and the objective is to find the shortest path that
traverses all the given points. Table 5 provides a summary of the experiment outcomes.

Table 5: Approximating TSP solutions using latent sort algorithm. We report the percentage of all
possible paths traversing all points (generated using brute-force enumeration) that are longer than
the solution predicted by the latent sort algorithm. The results are reported in “mean ± standard
deviation” from 10 random runs of generating N random points and running latent sort.
N=5 N=6 N=7 N=8
LS w. LGP 93.17% ± 11.77% 92.00% ± 8.69% 97.01% ± 6.10% 99.59% ± 0.34%
LS w.o. LGP 82.17% ± 13.12% 90.44% ± 11.44% 94.67% ± 6.64% 96.61% ± 5.14%

Latent Sort (N = 8) Brute-Force (N = 8) Latent Sort (N = 8) Brute-Force (N = 8)


1.0 1.0
0.6 0.6

0.8 0.8

0.4 0.4
0.6 0.6

0.4 0.4
0.2 0.2

0.2 0.2
0.0 0.0
0.25 0.50 0.75 0.25 0.50 0.75 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

Latent Sort (N = 8) Brute-Force (N = 8) Latent Sort (N = 8) Brute-Force (N = 8)


1.0 1.0
0.8 0.8
0.8 0.8
0.6 0.6
0.6 0.6

0.4 0.4
0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.5 1.0 0.0 0.5 1.0 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75

Figure 6: Examples solutions of of TSP problems (N = 8) with randomly initialized points. It


includes approximated solutions using the latent sort algorithm and the exact solutions using brute
force search.

The results consistently demonstrate that employing latent sort with the latent gradient penalty (LGP)
achieves higher approximation rates compared to using latent sort without LGP. This finding supports
the conclusions discussed in Section 5.2 regarding the effectiveness of LGP in the latent sort algorithm
for approximating TSP solutions. Note that the methods in general appear to perform better for a
higher number of points, as there are more possible paths in total, leading to a higher percentage of
paths being longer than the one predicted by latent sort.

A.2 Semantic Scene Graph Generation on OpenPSG

Here we provide some examples of the results (see Figure 7) for the semantic scene graph generation
task on the OpenPSG dataset.
We can see that GraphART generally tends to predict slightly smaller scene graphs compared to
the ground-truth labels. This model behavior can be attributed to the presence of label noise in

13
(a)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
person attached to person person attached to person
person attached to teddy bear person holding teddy bear
person playing with teddy bear person sitting on wall-other-merged
teddy bear attached to person

(b)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
person carrying handbag person beside person
person on road person walking on pavement-merged
person crossing road person standing on pavement-merged
person walking on pavement-merged car parked on pavement-merged
person looking at pavement-merged sky-other-merged over building-other-merged
traffic light over road pavement-merged attached to pavement-merged

(c)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
person standing on pavement-merged car driving on road
car on road sky-other-merged over tree-merged
car parked on road sky-other-merged beside pavement-merged
car driving on road
traffic light over road
sky-other-merged over building-other-merged

(d)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
person talking to person person attached to pavement-merged
person holding backpack person walking on pavement-merged
person talking to cell phone traffic light beside traffic light
person walking on pavement-merged traffic light attached to pavement-merged
person standing on pavement-merged window-other in building-other-merged
person looking at pavement-merged
bicycle parked on pavement-merged
traffic light beside traffic light
potted plant attached to building-other-merged

(e)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
chair beside cabinet-merged chair on floor-wood
couch enclosing table-merged couch beside couch
book hanging from wall-other-merged couch on floor-wood
door-stuff on wall-other-merged book on shelf
light on couch
table-merged on rug-merged
wall-other-merged on curtain

(f)
Ground-Truth Scene Graph GraphART (Latent Sort) Predictions
Object 1 Predicate Object 2 Object 1 Predicate Object 2
person beside person person beside person
person feeding giraffe person over grass-merged
giraffe on dirt-merged giraffe standing on person
pavement-merged beside giraffe giraffe over giraffe
giraffe beside giraffe
giraffe over tree-merged
giraffe beside grass-merged
sky-other-merged over giraffe
sky-other-merged over tree-merged
sky-other-merged over grass-merged

Figure 7: Randomly sampled results for semantic scene graph generation on the OpenPSG dataset.
Each image consists of the input image (left), ground-truth labels (middle), and predictions generated
by GraphART using the latent sort algorithm (right). The samples show the label noises in the dataset.

14
the OpenPSG dataset, although the dataset is already known to have less noise compared to other
scene graph datasets [3]. Due to this issue in the training data, GraphART becomes cautious in
predicting triplets with low confidence. Instead, it shows a preference for the most common objects
and predicates, as well as the prevalent patterns present in the dataset. Consequently, the model tends
to predict the stop token earlier than expected, resulting in predicted graph sequences that are likely to
include only the most common objects and predicates. As a result, the prediction sequences conclude
before reaching the size of the ground-truth labels (see Figure 7).
We believe that this issue does not seem to be specific to GraphART alone, but rather a common prob-
lem encountered by autoregressive models when dealing with label noise in the dataset. Interestingly,
we observed that the model’s predictions are occasionally more reasonable and logically consistent
than the ground-truth labels. For example, in Figure 7(a), the model predicts person - holding -
teddy bear, which is more accurate than person - attached to - teddy bear in the ground-
truth labels. In Figure 7(d), the model predicts window-other - in - building-other-merged,
which is missing from the ground-truth labels, and the labels have a lot of duplicated relations for
person. In Figure 7(e), the model predicts book - on - shelf, which is more accurate than book
- hanging from - wall-other-merged in the ground-truth labels. We believe that for future
research, it is necessary to develop scene graph datasets with cleaner labels in order to enable fair
benchmarking and evaluate the performance of different models more accurately.

B Additional Information For Experiment Setups


In this section, we will provide additional details regarding the experiment setups, as well as provide
details of procedures for conducting experiments involving the models and tasks discussed in the
paper for full reproducibility of our results. Note that, the amount of details provided here are limited
comparing to the released code 2 .

B.1 Additional Information For Datasets

Scene Graph Generation: We use the OpenPSG dataset [3] which contains 49k annotated images,
and we follow the same train-test split as [3] , which has around 45k for training and 4k for testing.
Our problem setup for scene graph generation is slightly different from most of the existing works
on scene graph generation [2, 4, 3]. In particular, we consider semantic-level scene graphs where
every node represents a generic object and is not associated to a particular localized (or segmented)
object. For example, a “person1 - beside - person2” triplet will be converted into “person -
beside - person” in our problem setup. We chose this problem setup because our main focus is
to investigate the effect of different ordering methods on graph generation rather than striving for
state-of-the-art (SOTA) performances. As such, we avoid complex segmentation settings that could
potentially divert our focus. However, this doesn’t imply an easier task. In fact, our model must infer
the number of objects implicitly without resorting to object-level granularity. This contrasts with
methods like PSGTR and PSGFormer used in conventional scene graph generation tasks, where the
notion of objects is explicitly provided for training via segmentation masks.
We adopt the same data preprocessing pipelines used for training PSGTR and PSGFormer, except
that we do not use the panoptic segmentation labels for training. We also have a simpler cropping
function instead of the sophisticated multi-size cropping used in PSGTR and PSGFormer, and our
cropping generates smaller images that result in faster training. More specifically, we pad the input
images to 384 × 384, where the shorter side will be padded by 0 values to make the image squared
(equal aspect ratio). In contrast, the PSGTR and PSGFormer uses 640 × 384 or 1333 × 640. The
difference in input size may be the factor that GraphART performs worse in identifying small objects.
Then, we parse the object-level scene graphs into semantic-level scene graphs involving 133 object
classes and 56 predicate classes. While state-of-the-art (SOTA) methods for this problem, such as
PSGTR [3] and PSGFormer [3], use additional panoptic segmentation labels for training, we do
not use this additional supervision in our framework. This is done to demonstrate the capability of
autoregressive models to directly generate scene graphs from images.
Note that the central goal of this paper is not to find a superior vision model that achieves state-of-
the-art performances in scene graph generation. Thus, we did not fully optimize the pipeline and
2
https://anonymous.4open.science/r/ordering-in-graph-gen/

15
there are still potentials in GraphART to reach better performances for this particular task (scene
graph generation). We welcome future works to explore the improvement that can be done using
GraphART.
Topological Graph Extraction: The task is based on the Toulouse Road Network dataset from [1],
which consists of 99k image-graph pairs. The dataset comprises 64 × 64 grayscale images paired
with corresponding topological graphs. We follow the same train-test split as described in [1], where
approximately 80k image-graph pairs are used for training, and 19k pairs are used for testing. The
dataset is derived from OpenStreetMap and involves the processing and labeling of raw road network
data from the city of Toulouse, France, which was obtained from GeoFabrik and timestamped in
June 2017. Each image represents a small cropped patch of the city map, and the coordinates are
normalized to [-1, 1]. Upon inspecting sample images from the dataset, we observed that the graph
structures in most patches are generally simple, reflecting the inherent simplicity of real-world road
networks.
To assess the models’ performance under more challenging conditions, we decided to create a syn-
thetic Planar Graph dataset from scratch. This dataset is generated by utilizing Delaunay triangulation
and the NetworkX package to generate random planar graphs. The node coordinates are sampled
from a uniform distribution U (0, 1)2 . Initially, the graphs contain 15 nodes. Preprocessing involves
collapsing nodes that are too close to each other, with a threshold of 0.1. Similarly, edges are collapsed
if they form an angle less than 30 degrees. After collapsing nodes and edges, self-loops are removed
to obtain the target graph. Subsequently, we utilize the OpenCV package to render the target graph
into 256 × 256 binary images, with edges represented by lines of random widths. To introduce some
variability, random artifacts are added to the images. The graph structures in this dataset are generally
more challenging when compared to the Road Network dataset. During training, minibatches of
images and ground-truth graphs are generated on the fly, resulting in non-repeating dataset (of 40k
samples every epoch) and zero generalization gap in the trained model. During testing, we test the
performance of the results using 10 random seeds on 1000 generated samples.
Circuit Graph Prediction: We use the Terahertz channelizer dataset3 consisting of 347k circuit
graphs. This dataset was introduced in the work by He et al. [5], which presented a novel approach
to represent circuits using graphs, where resonators are represented as nodes and electromagnetic
couplings as edges. The authors mapped the geometric configuration of a circuit to the graph structure,
using node attributes for resonator properties, while edge attributes captured the relative positions,
gap lengths, and shifts between resonators. The presence of edges in the graph is determined by a
distance threshold based on prior research. We recommend readers to go through the original paper
for further details on the dataset, which includes visual examples of the circuits and their graph
representations.
In our paper, we consider the goal of solving the “inverse” problem, where we are expected to
predict the ground-truth circuit graph that consists of 3 to 6 resonators (or nodes), given desirable
electromagnetic (EM) properties (e.g., transfer function) as inputs. Each resonator (node) is rep-
resented by a 9-dimensional vector. We follow the problem statement of the forward and inverse
problems as the original work [5]. Since the scale of some dimensions in the resonator vectors is
much larger than others, which may dominate the prediction errors, we normalize the data using
min-max normalization per dimension. We report the errors on the normalized data.

B.2 Model Configurations And Baseline Methods

Baselines in Scene Graph Generation: For the scene graph generation task (Table 3), we consider
PSGTR [3] and PSGFormer [3] as state-of-the-art (SOTA) baselines. We ran the evaluation scripts
and trained checkpoints as provided by Yang et al 4 to obtain the predictions on the test set, where
each node in the predicted scene graph is grounded by its pixel-accurate segmentation mask in the
image. Then, we convert the segmentation masks based scene graph into semantic scene graph. Since
the output graph size of these methods depends on a confidence thresholding hyper-parameter, it is
challenging to tune it to match the ground-truth graph sizes. Moreover, these methods are tuned to
be biased towards achieving a good recall score, generally producing much larger graphs than the
3
Please make sure to carefully review and comply with the license permissions and restrictions (if there are
any) before using the dataset [5]: https://github.com/hehaodele/circuit-gnn
4
OpenPSG Github Repo: https://github.com/Jingkang50/OpenPSG

16
ground-truth (GT) labels. Therefore, we downsampled the output triplets to match the GT numbers
and report their average performance over 10 random runs. Note that the PSGTR and PSGFormer
utilize panoptic segmentation labels as additional supervision during training, which we do not use in
our framework. Additionally, we do not use the GT size of the scene graph as an input parameter in
our approach, in contrast to the implementation of PSGTR and PSGFormer.
Baselines in Topological Graph Extraction: For the road network generation task (Table 1), we
use the GGT [1] and GraphRNN [10] algorithms as baselines. These models are supervised on
adjacency matrices using BCELoss, in contrast to our purely autoregressive framework that does not
generate graph-specific components. For the planar graph generation task (Table 2), we designed a
non-autoregressive architecture called GraphTR, inspired by the DETR architecture [30]. GraphTR
utilizes bipartite matching loss and does not rely on autoregressive generation.
Baselines in Topological Graph Prediction: Regarding the circuit prediction task (Table 4), the
original work [5] does not directly solve the inverse problem from circuit specifications to circuit
graphs. Therefore, we implemented a model adapted from the original design using a graph attention
network (GAT) [42] and fed it with the ground-truth size of output graphs. We model each circuit as
a Graph Attention Network that is initialized to a projection of the input signals. We also initialize
the edge attributes to the electromagnetic coupling, which can be seen as a distance measure between
the nodes. Our task is to predict the nine “raw” node features for each node of the circuit graphs. We
focus on only predicting the node features as the original work [5] derives both node attributes and
edge attributes from these node features. Note that the CircuitGNN in the original work does not
have the capability to predict the target graph size (the number of resonators in the circuit) for a given
input, therefore we train different models for different graph sizes, specifically one model for one
size, as handled by the original work [5].
GraphART And GraphLSTM: We have different model configurations for each task included in
the paper, and GraphLSTM use the same configurations as GraphART for all the tasks except the
decoders are LSTM models instead of autoregressive transformers. Unless specified, the dropout or
drop-path probabilities are set to 0.

EfficientNet-B0
Rendered Image Last Layer Feature Spatial Linear Layer
of The Target (Pretrained on + (1280 -> 512)
Topological Graph ImageNet-1k) Flatten
Feature Vectors
Input Tokens
Positional
Planar Graph: 3 x 256 x 256 Embedding (ViT)
Road network: 3 x 64 x 64

ResNet-50
Input Image For Last Layer Feature Spatial Linear Layer
Scene Graph
Dropout Prob. = 0.1 + (2048 ->1024)
(Pretrained on Flatten
Generation ImageNet-1k) Feature Vectors
Input Tokens
Positional
3 x 384 x 384 Embedding (ViT)

[0.16 + 0.55 j
Complex-Valued 0.57 + 0.95 j MLP-2
Transfer Function … ReLU Activated
For Circuit Graph 0.08 + 0.30 j (2 -> 256 -> 256)
0.94 + 0.01 j ] Input Tokens
128 x 2

Figure 8: The encoder architecture of the GraphART and GraphLSTM for different tasks.

The model configurations include the following details:


• Visual Encoders: Figure 8 presents the common encoder architecture for both GraphART
and GraphLSTM. These models utilize either an Efficient-B0 [43] or a ResNet-50 [44]
pretrained on the ImageNet-1k dataset [45] as the visual backbone network. The input
images are encoded into feature vectors after flattening the spatial dimensions of the output
feature map. The positional embeddings used in the original Vision Transformer (ViT) [46]
are adopted.
• Autoregressive Decoders: Figure 9 and 10 illustrate the detailed architectures for GraphART
and GraphLSTM, respectively. The GraphART model employs a GPT-2-based architecture

17
ART - Small
LayerNorm MLP-2
Road Network Num. of Blocks = 6
GELU Activated
Num. of Heads = 8 (Dim. = 384)
Extraction QKV Dim. = 384
(384 -> 384 -> 4)
FF-Net Dim. = 768
Input Tokens Output Sequence

ART - Medium
LayerNorm MLP-2
Planar Graph Num. of Blocks = 8
GELU Activated
Num. of Heads = 8 (Dim. = 512)
Generation QKV Dim. = 512
(512 -> 512 -> 4)
FF-Net Dim. = 1024
Input Tokens Output Sequence

ART - Large
Num. of Blocks = 8
LayerNorm MLP-2
Scene Graph Num. of Heads = 8 GELU Activated
(Dim. = 768)
Generation QKV Dim. = 768 (768 -> 768 -> 325)
FF-Net Dim. = 1536
Dropout Prob. = 0.1
Input Tokens Output Sequence

ART - Medium
Num. of Blocks = 8 LayerNorm MLP-2
Circuit Graph GELU Activated
Num. of Heads = 8 (Dim. = 512)
Prediction QKV Dim. = 512 (512 -> 512 -> 9)
FF-Net Dim. = 1024
Input Tokens Output Sequence

Figure 9: The decoder architecture of the GraphART for different tasks. The autoregressive trans-
former (ART) is based on GPT-2 architecture [17] with our modified decoder block architecture (see
Figure 3b).

LSTM - Small MLP-2


Road Network LayerNorm
torch.nn.LSTM GELU Activated
(Dim. = 512)
Extraction Num. of Blocks = 4 (512 -> 512 -> 4)
Hidden Dim. = 512
Input Tokens Output Sequence

LSTM - Small MLP-2


Planar Graph LayerNorm
torch.nn.LSTM GELU Activated
(Dim. = 512)
Generation Num. of Blocks = 4 (512 -> 512 -> 4)
Hidden Dim. = 512
Input Tokens Output Sequence

LSTM - Large
torch.nn.LSTM LayerNorm MLP-2
Scene Graph Num. of Blocks = 5 GELU Activated
(Dim. = 1024)
Generation Hidden Dim. = 1024 (1024 -> 1024 -> 325)
Dropout Prob. = 0.1
Input Tokens Output Sequence

LSTM - Medium LayerNorm MLP-2


Circuit Graph torch.nn.LSTM GELU Activated
(Dim. = 768)
Prediction Num. of Blocks = 5 (768 -> 768 -> 9)
Hidden Dim. = 768
Input Tokens Output Sequence

Figure 10: The decoder architecture of the GraphLSTM for different tasks.

with stacked transformer decoder blocks, as depicted in Figure 3b. During training, teacher-
forcing is used by feeding the ground-truth sequence to the model. During inference, the
model performs iterative next token prediction until a stop criterion is met. For road network
extraction, planar graph generation, and circuit graph prediction, stop tokens are represented
by vectors with all −1 values. If the mean absolute distance of a generated token to the vector
of all −1 values is less than 0.8, it is considered a stop token, leading to the termination of
iterative sequence generation.

We try to control the number of parameters to be similar among different models for the same task.
Please refer to the figures for a visual representation of the mentioned architectures.

B.3 Additional Information For Evaluation Metrics

To evaluate the performance of graph generation results in relation to the ground truth, it is appropriate
to employ metrics that quantify the dissimilarity between two sets. For the topological graph

18
generation tasks, we employ the StreetMover distance as was done in [1]. In addition, drawing
inspiration from the Earth Mover’s distance and Hausdorff distance, we introduce the Edge Mover
Distance (EMD) and Edge Hausdorff distance (EHD) metrics specifically tailored for the "edge-based
tokenization" scheme used in our models (refer to tokenization in Section 4.2).
StreetMover distance (SMD): The SMD is proposed by [1] as an evaluation metric for evaluating
generative models in the context of road networks. It addresses the limitations of existing metrics
by jointly capturing the accuracy of the reconstructed graph while being invariant to changes in
graph representation, transformations, and size. Unlike pixel-based metrics, SMD considers the
global alignment and magnitude of errors in the reconstructed graphs, making it more suitable for
evaluating road network generation. It overcomes the drawbacks of other metrics, such as Average
Path Length Similarity (APLS), which require post-processing steps and are designed for different
purposes. SMD is easily interpretable and computationally efficient, leveraging Sinkhorn iterations
for its computation.
Edge Mover Distance (EMD) And Edge Hausdorff Distance (EHD):
Given two finite length sequences of N-dimensional points X and Y (where X and Y are matrices of
N columns), we denote X and Y as the corresponding unordered sets of X and Y . In other words, X
and Y are sets that contain the rows of the corresponding matrices, where each row represents an
N-dimensional point. Additionally, X and Y are sets that contain all the rows, and every element
within these sets is a row from the matrices.
The EMD between two sequences X and Y (represented by matrices) can be defined using its
unordered sets X and Y:
1 X
EMD(X , Y) = min ρ(x, y), (7)
|X | y∈Y
x∈X
and the EHD can be defined as:
 
EHD(X , Y) = max sup inf ρ(x, y), sup inf ρ(x, y) , (8)
x∈X y∈Y y∈Y x∈X

where |X | is the cardinality of set X representing the number of points in X . The function ρ(x, y)
represents the distance between points x and y. This distance can be calculated using various metrics,
such as the Euclidean distance or any other suitable distance metric for the given problem domain.
The first term supx∈X inf y∈Y ρ(x, y) computes the maximum distance from any point in X to
its nearest neighbor in Y, while the second term supy∈Y inf x∈X ρ(x, y) computes the maximum
distance from any point in Y to its nearest neighbor in X . The EMD calculates the average of the
minimum distances between each point in set X and its nearest neighbor in set Y, whereas the EHD
measures the maximum distance between the closest points of two sets X and Y. The EHD is the
larger of these two values, representing the maximum discrepancy between the two sets.
Classification Metrics For Sequences And Sets of Different Sizes:
In traditional classification settings, precision, recall, and F1 score are typically computed on a per-
instance basis. Each individual item (or instance) in the prediction set is compared to its corresponding
item in the true set, and the counts of true positives, false positives, and false negatives are aggregated
over all instances, often using "macro" averaging. This approach works well when the prediction
set and true set have an equal number of instances. However, in situations where the prediction set
and true set have a different number of instances, the traditional per-instance calculation becomes
inapplicable. It is crucial to establish new definitions for these metrics that can appropriately handle
the discrepancy in set sizes.
We use the same definition of X and Y as unordered sets and X and Y as predicted and ground-truth
sequences. To calculate TP (True Positives), we compute the cardinality of the intersection of X and
Y, which represents the count of matches or true positives. Specifically, we express TP, FP, and FN
as follows:
T P = |X ∩ Y|
F P = |X | − T P
F N = |Y| − T P
where | · | represents the cardinality of a set. Using this new definition of TP, FP and FN, we define
Precision, Recall, and F1 Score between two sets of different sizes (cardinality) as follows:

19
Precision is the proportion of correct predictions (TP) to the total number of predictions in the
predicted sets. It reflects the percentage of predictions that are correct. Mathematically, Precision is
defined as:
TP |X ∩ Y|
P recision = = (9)
TP + FP |X |
Recall is the proportion of correct predictions (TP) to the total number of elements in the targe set. It
reflects the percentage of ground-truth elements that are included in the predictions. Mathematically,
Recall is defined as:
TP |X ∩ Y|
Recall = = (10)
TP + FN |Y|
Similar to the traditional definition, the F1 Score for sets is the harmonic mean of Precision and
Recall. The F1 Score gives equal weight to both Precision and Recall and provides a balanced
measure of model performance. High Precision and high Recall will yield a high F1 Score. The F1
Score is defined as:
2 × P recision × Recall 2 × |X|X∩Y| |X ∩Y|
| × |Y|
F1 = = |X ∩Y| |X ∩Y| (11)
P recision + Recall + |X | |Y|

Using the above definitions, there are precision, recall and F1 scores for each pair of predicted and
ground-truth graphs. We report the averaged values over the entire test set.

B.4 Training Setups For Graph Generation Models

The configurations for the graph generation models are described as follows:

• Planar Graph (Table 2) & Road Network (Table 1): We train the models using the Adam
(planar graph) or AdamW (road network) optimizer with the following settings: learning
rate = 1.0 × 10−4 , weight decay = 0 (planar graph) or 1.0 × 10−4 (road network), and an
effective batch size = 300 (150 on 2 GPUs). The warm-up period is set to 10% of the total
epochs, and we employ "cosine" annealing. Sequences are padded to the maximum length
of sequences plus one within a batch, using a padding value of -1. The stop token is also
represented by the value -1. Training 1000 epochs on the planar graph dataset typically
takes approximately 16 to 20 hours on two NVIDIA A100 GPUs with 128 CPU cores, using
32-bit precision5 , and it takes around 6 to 8 hours to train for 200 epochs on the road network
dataset using a single NVIDIA A100 GPU with 16 CPU cores.
• Scene Graph (Table 3): We train the models for 20 epochs using the AdamW optimizer
with the following settings: learning rate = 1.0 × 10−4 , weight decay = 5.0 × 10−4 , and an
effective batch size of 100. The warm-up period is set to 10% of the total epochs, and we
employ "cosine" annealing. Sequences are padded within a batch to the maximum length of
sequences plus one, with the addition of stop tokens. For each object and their predicate that
are one-hot encoded, an additional dimension is added for the "stop token" class, resulting
in a total of three additional dimensions. Training typically takes approximately 2 hours on
a single NVIDIA A100 GPU with 128 CPU cores, using 32-bit precision.
• Circuit Graph (Table 4): We train the models using the AdamW optimizer with the
following settings: learning rate = 1.0 × 10−4 , weight decay = 1.0 × 10−4 , and an effective
batch size = 300 (150 on 2 GPUs). The warm-up period is set to 10% of the total epochs, and
we employ "cosine" annealing. Sequences are padded to the maximum length of sequences
plus one within a batch, using a padding value of -1. The stop token is also represented
by the value -1. Training 200 epochs typically takes approximately 20 to 24 hours on two
NVIDIA A100 GPUs with 64 CPU cores, using 32-bit precision.

B.5 Training Latent Sort Encoders

We trained 10 latent sort encoders for each experiment and selected the one that yielded the best
validation performance for the paper. We observed that the difference in graph generation performance
due to different random initializations of the latent sort encoders is less significant than the impact of
5
We observed degradation in model accuracy when using 16-bit precision in training.

20
the hyperparameters used in training the encoders. Based on our observations, here are some key
takeaways to guide the design of latent sort encoders:

• We found that different random initializations of the latent sort encoders can lead to less
than a 5% difference in graph generation performance (measured in terms of EMD values
using GraphART).
• Larger latent sort encoders do not necessarily result in better graph generation performances,
even if they have lower reconstruction errors. In our experiments, we utilized multilayer
perceptrons as encoders and decoders, each consisting of three hidden layers with 512
neurons and tanh activation..
• The most important hyperparameters we identified include: the coefficient of LGP in the
loss function, the strength of L2 regularization, and the number of training epochs. The
trade-off parameter controlling the strength of LGP regularization is the most important
hyperparameter, and we used a range of 0.01 to 0.1 in our experiments. We found that L2
regularization was unnecessary (thus set to 0) when using LGP. Generally, longer training
epochs do not lead to a degradation in graph generation performance.
• Learning rate scheduling plays a crucial role in the performance of latent sort encoders.
In our experiments, we employ Adam optimizers and cosine learning rate annealing. The
model undergoes a warm-up phase at the beginning of training, lasting 10% of the total
training epochs. During this phase, the learning rate is exponentially increased from an
initial value of 1.0 × 10−5 . Subsequently, the learning rate gradually decreases until it
reaches a final value of 1.0 × 10−8 .
• We observe that using the undirected graph loss as reconstruction does not produce significant
differences when compare to the L1 reconstruction loss for the topological graph generation
tasks. For ciruit graph generation, we use L1 loss for reconstruction. For the scene graph
generation task, we use binary cross entropy (BCE) loss as used in training the graph
generation models.

C Supplementary Theoretical Results And Discussions


This section supplements the theoretical results presented in Section 3 by providing a more detailed
and comprehensive explanation of the overall framework. The structure of this section is as follows:

C.1 includes the definition and motivations of sorting algorithms from the perspective of dimen-
sionality reduction (DR), along with other important notations. It also includes a discussion
of how sequence orderings affect model behavior.
C.2 introduces the definition of ordering ambiguity for sorting algorithms.
C.3 discusses the expected prediction errors caused by ordering ambiguity.
C.4 introduces the idea of the latent sort algorithm. It also includes two theorems to reveal some
of its properties.
C.5 discusses prediction errors from imperfection reconstruction in the latent sort algorithm.
First, it connects the reconstruction error (assuming normal distribution) with the error
distribution of the 1-D latent values. Theorem C.3 shows how to estimate the 1-D latent
distribution (in terms of mean and variance) from the distribution of reconstruction errors.
Then, it discusses the ordering errors that may be caused by imperfection reconstruction.
C.6 provides a way to simplify the ordering errors from imperfection reconstruction.
C.7 discusses the desirable properties of an optimal ordering to minimize the errors from
imperfection reconstruction using the simplified expression of errors from C.6.

C.1 Sorting As A Dimensionality Reduction Problem

The Ordering Problem in Set Generation: Since we tokenize each graph into a set of N -dimensional
tokens (N > 1), we denote a graph as a set X consisting of M unique points (we also refer to these
as vector tokens) within an N -dimensional space. These points are unordered, which means they
don’t follow a specific sequence or pattern.

21
Our goal is to discover the best possible (or optimal) ordering for these points, which when used as
the target sequence to supervise an autoregressive model yields optimal performance in generating X .
We denote the sequence obtained by applying the optimal ordering as Y ∗ . To visualize this, imagine
Y ∗ as a matrix where every row represents a point in X . Each point is assigned an index which
signifies its position in this optimal order.
Finding Y ∗ is very challenging since we do not know what orderings the autoregressive models
“prefer” that yield optimal results. In fact, different model architectures or different applications
can have different optimal orderings. For example, in an RNN, each output at time t depends on
the previous outputs and the current input, forming a sequence of dependencies (i.e., a chain-like
structure). This means information has to flow sequentially from one step to the next. In contrast,
the transformer model’s attention mechanism allows it to have direct dependencies between all pairs
of positions in the sequence. Therefore, information can flow directly between any two positions
in a sequence, which allows transformers to capture longer dependencies more efficiently. As a
result, the optimal ordering of input elements for an RNN may show temporal or spatial continuity,
given its inherent sequential processing nature, while a Transformer may not strictly require such
continuity due to its capability to process all elements concurrently and directly. Moreover, the type
of application also plays a vital role in determining the optimal ordering. Therefore, our goal isn’t
simply to find an ordering that fits all models, but rather to discover an ordering that is best suited to
the model’s architecture and the specific use-case it is being applied to. The optimal orderings could
also potentially be influenced by factors like the data distribution, the complexity of the learning task,
and even the particular optimization algorithms used during the training process.
How About End-to-End Learning Based Ordering?
One might naturally consider employing an end-to-end learning approach, in which the process of
identifying an optimal ordering is integrated directly into the training of the model. This approach
could, for example, involve a Transformer model, known for its permutation invariance at the encoder
stage when no positional embeddings are used. However, at the decoder stage, it could be designed
to attend to a learnable positional embedding, essentially “learning” the optimal sequence ordering as
part of the training process. This can be an appealing idea since the model is potentially capable of
dynamically adapting to the best ordering for the given data and task.
However, there’s a significant challenge when employing such an end-to-end training scheme. Neural
networks, including both the ordering model and the autoregressive model, tend to have a critical
learning phase [47] in the early epochs, where they learn essential features and structures of the data.
In an end-to-end setup, the ordering model is consistently changing the sequence ordering to try to
best suit the autoregressive model, while the autoregressive model is simultaneously trying to adapt
to the new ordering provided by the ordering model. This dual adaptation can create issues with
convergence, since we cannot guarantee the two models will ultimately "agree" with each other. This
can result in a chaotic learning process where the two models are continuously trying to adapt to each
other’s changes without reaching a stable state. Indeed, in our experiments, we have observed that
such end-to-end learning systems often fail to converge, which motivated us to discontinue them in
our work.
Why We Opt For Dimensionality Reduction (DR)?
A natural question to ask is, how is sorting related to Dimension Reduction (DR)? We argue that,
although sorting in a differentiable manner is challenging, finding differentiable methods for DR is
more straightforward. This offers new opportunities for creating learning-based methods to order
points in high-dimensional spaces.
In a simpler scenario, let’s say we are working with 1-D representations of all points in X , which
we denote as H, and have normalized them to a range between 0 and 1. This allows us to define the
problem of finding the optimal ordering as discovering the optimal DR mapping.
Consider a sorting algorithm, s, which takes X and transforms it into Y , the ordered sequence of
the points. We can represent this sorting algorithm with a DR mapping, f , which takes points in the
original high-dimensional space xi and reduces it to their 1-D value, hi .
Note that this DR mapping, f , assigns each token xi a specific position, hi , in the latent space,
regardless of the combinations of tokens it appears with in X . Therefore, the task of ordering any

22
arbitrary combination of tokens in X essentially boils down to finding a sorted path traversal in their
1-D latent space representations.
Why Local DR Instead of Non-Local DR?
We broadly classify DR methods that can be used for sorting into two categories: local and non-
local. Local DR methods compute the DR mapping based solely on the value of an individual
token, independent of other tokens in the set. In contrast, non-local DR methods involve the
interaction between different tokens in the set. Non-local methods such as Convolutional Neural
Networks (CNNs) or autoregressive Transformer models inherently assume some form of structure
or relationships among tokens when performing DR.
For the problem at hand, we do not have any prior knowledge about the best ordering of tokens.
Therefore, assuming any form of relationship among tokens, as non-local DR methods do, might
not be appropriate. For example, in the case of 1D-CNNs, the 1-D representation depends on the
arrangement of neighboring tokens. Since we do not know the correct ordering, deciding what tokens
should be considered neighbors can be ambiguous and arbitrary. This ambiguity will then be reflected
in the 1-D representations, leading to potential inconsistencies and inaccuracies. Similarly, using
autoregressive Transformers for DR involves an even greater level of complexity since every token can
potentially influence every other token. However, unlike standard Transformers, the autoregressive
nature of these models limits the flow of information to a directional sequence, which complicates
the global interaction of tokens when the correct ordering is unknown. This suggests an implicit
assumption of a global sequence structure among all tokens in the set. Yet, without knowing the
correct ordering of tokens, determining this structure becomes a chicken-and-egg problem.
There is one notable exception to the aforementioned issue with non-local DR methods: the self-
attention mechanism of Transformers without using positional encoding. It treats the token set as a
whole, without assuming any specific order or structure, thanks to its full-attention property. This
full-attention mechanism, which allows the model to attend to all tokens in the set, is the key to the
Transformer’s permutation invariance. However, the full-attention mechanism has one significant
drawback when used in the context of autoregressive models. The full-attention mechanism is
inherently bidirectional, meaning that it allows for information flow from both "left" and "right"
tokens, effectively modeling P (xi |x1 , x2 , ..., xM ). This is contrary to the autoregressive models that
we use for graph generation, which only attend to "left" side tokens and learn P (xi |x1 , x2 , ..., xi−1 )
[19]. The bidirectional nature of the full-attention mechanism means that it has access to future
information that is unavailable to the autoregressive models. Consequently, the autoregressive models
cannot learn the ordering rule represented by the full-attention Transformer, as they are inherently
unable to incorporate “right" side information. This discrepancy between the two types of models
poses a significant challenge in aligning the orderings learned by them, and thus limits the utility of
using a full-attention Transformer for ordering in the context of training autoregressive models.
In contrast to all the non-local DR methods mentioned above, local DR methods like MLPs do not
require any assumptions about relationships among tokens. The 1-D representations are computed
based solely on the properties of individual tokens, which are the only certain entities in our context.
Therefore, in our scenario where the correct ordering of tokens is unknown, local DR proves to be a
more reliable and less assumptive approach for obtaining 1-D representations of tokens.
Restated Definition of the Ordering Problem: Given a set of M unordered points (or vector
tokens) X ⊆ RN , |X | = M in an N -dimensional space (N > 1), we are interested in finding the
“optimal” ordering of points in X , denoted by Y ∗ ∈ RM ×N , which when used as the target sequence
to supervise an autoregressive model yields optimal performance in generating X . Formally, let us
denote the ordered sequence Y ∗ as the matrix [x∗1 , x∗2 , ..., x∗M ] , where every row x∗i of Y ∗ is a

point in X and i denotes its sorted index in the optimal ordering. For now, we assume that such an
ordering exists for every set X , and we will discuss the desired property of it in Section C.7. For
simplicity, we assume the 1-D representations of all the points in X , denoted by H, are normalized to
[0, 1].

Definition 3.1 (DR-based Sorting Algorithm). Given a sorting algorithm s : X → Y , let Y =


[x1 , x2 , ..., xM ] be the resulting ordered sequence for the unordered set X ⊆ RN . We can then

represent the sorting algorithm s by a DR mapping f : RN → R, such that the ordering in the original
N -D space is given by sorting their 1-D values, hi = f (xi ) ∈ R, and ∀i ≤ j ≤ M, f (xi ) ≤ f (xj ).

23
We can thus formulate the problem of finding the optimal sorting s∗ : X → Y ∗ as finding an optimal
DR mapping f ∗ : RN → R, h∗i = f ∗ (x∗i ), and ∀i ≤ j ≤ M, f ∗ (x∗i ) ≤ f ∗ (x∗j ).
Modeling the Errors of a Sorting Algorithm: Assuming we know the target sequence Y ∗ and
we only want to measure the alignment between a sequence Y and Y ∗ , we introduce a probability
matrix P ∈ RM ×M , where each entry pij within P is designed to represent the probability of xi
(i-th element in Y ) being x∗j (j-th element in Y ∗ ). In this way, the expected value of the ordering Y ,
denoted by E[Y ], can be computed as P Y ∗ . The matrix P serves as a “soft" permutation matrix that
transforms Y ∗ into Y . This error representation is generic for analyzing different types of errors that
may emerge in a sorting algorithm.
For simplicity, we use Frobenius norm along with the P matrix representation:
 
2 2
E E[Y ], Y ∗ = E[Y ] − Y ∗ F
= PY ∗ − Y ∗ F
, (12)

where ∥ · ∥F denotes the Frobenius norm. We show that the minimization over E E[Y ], Y ∗ can be


solved by an equivalent problem of minimizing ∥P − IM ∥2F (see Lemma D.1).

C.2 Definition of The Ordering Ambiguity

We refer to the problem where multiple points in X are assigned the same 1-D representation value
as ordering ambiguity. It is problematic since the ordering of tokens with the same 1-D latent value
becomes undefined if no further tie-breaking schemes are defined.
Why Ordering Ambiguity Is Almost Inevitable to Local DR? As we discussed earlier, local DR
methods are more suitable for sorting tokens used for training the autoregressive models. Local DR
methods are characterized by the computation of dimensionality reduction solely based on the value
of an individual token, without considering the context or interactions among other tokens within the
set. Since high-to-low-dimensional mappings are mostly surjective (each element in the target space
has a pre-image in the domain) and not bijective (one-to-one correspondence), it is very likely that
multiple N-D points share the same 1-D representation value (see Remark E.3). This property is a
fundamental source of ordering ambiguity, making it an almost inevitable occurrence in local DR
methods. Non-local DR methods, on the other hand, can dynamically adapt to alterations in other
data points, thereby circumventing collisions in the low-dimensional space. This adaptability can
potentially prevent the occurrence of ordering ambiguity.
The ordering ambiguity of a sorting algorithm can be characterized by an ordering ambiguity set,
which is defined as follows:
Definition 3.2 (Ordering Ambiguity Set). Let Ai be the ordering ambiguity set for a point xi ∈ X ,
where Ai = {xj |∀xj ∈ X , f (xi ) = f (xj )}.

C.3 The Errors From Ordering Ambiguity

In the presence of ordering ambiguity, the resulting sorted sequence Y may not be an accurate
representation of the target sorted sequence Y ∗ . We can use the above P matrix to quantify the error
introduced by ordering ambiguity.
Assuming that Y = E[Y ] = Y ∗ in the absence of ordering ambiguity, we assume the autoregressive
model will converge to the expectation of the ambiguity set Ai for each point xi in Ai , i.e.,

xi = Exj ∈Ai [xj ] , (13)

When ordering ambiguity is present, we assume the probabilities are spread over the members of the
corresponding ambiguity set Ai uniformly, i.e.,
(
1
|Ai | if x∗j ∈ Ai
pij = (14)
0 otherwise

24
Therefore we have:
  ⊺ 
1
x∗j − x∗1 ⊺
P
  |A1 |
⊺ j ∈A1
x∗

1 ∗ ∗⊺ 
P


|A2 | xj ∈A2 j
∗ x x2
E[Y ] − Y ∗ = P Y ∗ − Y ∗ = 
 
.
 (15)
..
 
 
  ⊺ 
1 ∗ ∗ ⊺
P
|AM | x ∈AM j
∗ x − x M
j

Now we can calculate the error E E[Y ], Y ∗ by taking the squared Frobenius norm of the difference

between P Y ∗ and Y ∗ :
 
E E[Y ], Y ∗ = ∥P Y ∗ − Y ∗ ∥2F (16)
 
M 2
X 1 X ∗ ∗
=  xj − xi (17)
|Ai | ∗
i=1 xj ∈Ai
 
M 2
X 1 X
=  xj  − xi (18)
|Ai |
i=1 xj ∈Ai

Here, ∥ · ∥F denotes the Frobenius norm, which is the square root of the sum of the squared elements
of a matrix or vector. Note that we leverage the summation in the formula to make the error term
independent from the definition of the target sequence Y ∗ .
The Impact of Ordering Ambiguity on Autoregressive Models: The ordering of tokens within
an ordering ambiguity set is determined using a uniform distribution in the above analysis, which
might seem confusing, given that the DR methods we employ are deterministic and, theoretically,
should always yield a consistent order rather than a “random" one. However, the randomness can be
understood from the perspective of the training data input into the autoregressive models.
In training autoregressive models, our objective is to enable the model to learn generalizable rules
automatically from the training data, including those determining the ordering of tokens. Suppose
the ordering varies dramatically among different sequences in the training data with similar patterns.
In that case, the model may struggle to identify the factors causing these discrepancies in ordering.
Instead, it may resort to learning an average of possible orderings (see Figure 11), a phenomenon that
is commonly observed when the models have difficulty fitting the data.
The issue of ordering ambiguity compounds this problem. Without a consistent and reliable ordering
rule for tokens within the ambiguity set, different samples in the dataset might exhibit varied ordering.
Such inconsistency makes it challenging for autoregressive models to effectively learn the underlying
ordering rule. As such, the presence of ordering ambiguity in the training data can significantly
hamper the performance of autoregressive models in terms of sequence prediction.

Ground-Truth Graph Target Sequence Predicted Sequence Predicted Graph


Input Conditions
A Probability = 0.5 A
0.5 B + 0.5 C
A B C D
Tokenize Autoregressive = Construct
B C Graph
Graph Generator
Graph
B X C
A C B D A X X D
D Probability = 0.5 Supervise Using Loss Function
After Training D

Figure 11: Illustration of Ordering Ambiguity Problem In Autoregressive Graph Generation.

C.4 The Latent Sort Algorithm

The main idea of the latent sort algorithm is to use an auto-encoder to learn a dimensionality reduction
mapping, which enjoys high degree of flexibility when paired with different loss functions to achieve
different goals. In our design, the auto-encoder consists of two multi-layer perceptrons (MLPs) as the

25
encoder and decoder, respectively, denoted as fe and fd . We can use the encoder for sorting xi and
the ordering is given by the encodede 1-D latent representation hi = fe (xi ).
The Ordering Ambiguity of Latent Sort: Since the auto-encoder uses deterministic neural networks
(e.g., MLPs), the encoder and decoder are both surjective (see Remark E.4). The auto-encoder uses
reconstruction loss during training. If the auto-encoder has perfection reconstruction, its encoder and
decoder are both bijective (see Remark E.5) and there would be no ordering ambiguity. However, we
know it cannot be true for most cases (see Remark E.3). Therefore, when there exists no bijective
mapping between X and H, ordering ambiguity in inevitable in latent sort. When multiple points in
X are mapped to the same point in H, we can use Equation (2) to estimate the error due to ordering
ambiguity.
Some Theoretical Properties of The Latent Sort Autoencoders:
For latent sort, since we adopt auto-encoder for dimensionality reduction, there are more theoretical
properties due to the fact that the neural networks are often Lipschitz continuous.
Assume that the auto-encoder is initialized using a random distribution with a small variance, and
the model is trained with proper regularization, such that after training, the encoder and decoder
are Lipschitz continuous with constants Ke and Kd respectively, and Ke Kd ≥ 1. Assume for all
xi ∈ X and its reconstructed x̂i = fd (fe (xi )), ∥xi − x̂i ∥ has a constant upper bound B. Assuming
∀xi ∈ X , there always exist at least one point h∗i ∈ R such that fd (h∗i ) = xi , we have the following
theorems on the relationship between the latent representation error and the reconstruction error:

Theorem C.1 (Bounded Original Distance). Let xi and xj be two distinct points in X ⊆ RN , and
let hi = fe (xi ) and hj = fe (xj ) be their corresponding latent representations, and x̂i = fd (hi ) and
xˆj = fd (hj ) be their corresponding reconstructions. If |hi − hj | ≤ ϵ, where constant ϵ > 0, then the
distance ∥xi − xj ∥ has an upper bound 2B + Kd ϵ.

Proof. From the assumptions, we know that the decoder fd is Kd -Lipschitz continuous. Let fd (h∗i ) =
xi and fd (h∗j ) = xj , where h∗i and h∗j are the two 1-D representation values that could result in
perfect reconstruction. Since the reconstruction error has a constant upper bound B for all points in
X , we get
∥xi − x̂i ∥ = ∥xi − fd (hi )∥
= ∥fd (h∗i ) − fd (hi )∥
≤ Kd |h∗i − hi | ≤ B
∥xj − xˆj ∥ = ∥xj − fd (hj )∥
= ∥fd (h∗j ) − fd (hj )∥
≤ Kd |h∗j − hj | ≤ B
Using |hi − hj | ≤ ϵ and triangle inequality we have:
∥xi − xj ∥ = ∥fd (h∗i ) − fd (h∗j )∥
≤ Kd |h∗i − h∗j |
= Kd (|h∗i − hi + hi − hj + hj − h∗j |)
≤ Kd (|h∗i − hi | + |hi − hj | + |hj − h∗j |)
 
1 1
≤ Kd fd (h∗i ) − fd (hi ) + ϵ + fd (h∗j ) − fd (hj )
Kd Kd
∗ ∗
= ∥fd (hi ) − fd (hi )∥ + Kd ϵ + ∥fd (hj ) − fd (hj )∥
≤ 2B + Kd ϵ,
which gives the upper bound of ∥xi − xj ∥.
In addition, we need to examine whether the upper bound exists. Since the encoder is Ke -Lipschitz
continuous everywhere on X and we have assumed Ke Kd ≥ 1, thus
|hi − hj | ≤ Ke ∥xi − xj ∥ (19)
≤ Ke (2B + Kd ϵ) (20)
= 2Ke B + Ke Kd ϵ (21)

26
Since |hi − hj | ≤ ϵ and 2Ke B ≥ 0, inequality (21) is always true for any xi and xj in X . Therefore,
the upper bound exists as stated in the theorem.

Theorem C.1 establishes a relationship between the distance in the original space and the distance
in the latent representation space learned by an auto-encoder. It states that if two data points have
similar latent representations, their distance in the original space is also small, with an upper bound
proportional to the Lipschitz constants of the encoder and decoder.

Theorem C.2 (Bounded Latent Distance). Let xi and xj be two distinct points in X ⊆ RN , and let
hi = fe (xi ) and hj = fe (xj ) be their corresponding latent representations, and x̂i = fd (hi ) and
xˆj = fd (hj ) be their corresponding reconstructions. If the distance ∥xi − xj ∥ is greater than or
equal to a constant D, and D ≥ 2B, then the difference in their latent representations |hi − hj | is
larger than a constant lower bound (D − 2B)/Kd .

Proof. Let fd (h∗i ) = xi and fd (h∗j ) = xj . Since fd is Lipschitz continuous with constant Kd , we
know that
∥xi − x̂i ∥ ≤ Kd |h∗i − hi | ≤ B
∥xj − xˆj ∥ ≤ Kd |h∗j − hj | ≤ B
D ≤ ∥xi − xj ∥ ≤ Kd |h∗i − h∗j |

From the assumptions, we know that the distance ∥xi − xj ∥ ≥ D and D ≥ 2B. Using reverse
triangle inequality, we get
|hi − hj | = |(h∗i − h∗j ) − (h∗i − hi ) + (h∗j − hj )| (22)
≥ |h∗i − h∗j | − |(h∗i − hi ) − (h∗j − hj )| (23)

≥ |h∗i − h∗j | − |h∗i − hi | − |h∗j − hj | (24)


1
≥ D−B−B (25)
Kd
1
≥ (D − 2B) (26)
Kd
Thus we obtain the lower bound for |hi − hj |.
In addition, we need to examine whether the lower bound exists. Since fe is Ke -Lipschitz continuous
and Ke Kd ≥ 1, we have
|hi − hj | ≤ Ke ∥xi − xj ∥ (27)
1
(D − 2B) ≤ Ke ∥xi − xj ∥ (28)
Kd
D − 2B
≤ D ≤ ∥xi − xj ∥ (29)
Ke K d
Inequality (29) is always true therefore the lower bound exists as stated in the theorem.

Theorem C.1 provides a complementary perspective to Theorem C.2. While the first theorem bounds
the distance in the original space based on the distance in the latent representation space, the second
theorem bounds the difference in the latent representations based on the distance in the original space.
This is useful to estimate the error from latent ambiguity in latent sort, which we will discuss in the
later sections.
Interestingly, it has a connection to ordering ambiguity. If the ordering ambiguity problem is serious
and the ambiguity sets are large, then the upper bound B for the reconstruction error is expected to be
large as well. In such cases, D − 2B in the original space can shrink to close to zero, which inversely
supports the existence of ordering ambiguity.

27
C.5 The Errors From Imperfect Reconstruction

Previously, when discussing the errors from ambiguity, we assumed that Y = Y ∗ in the absence of
ordering ambiguity, even though it is not realistic since the auto-encoder used in latent sort is not
bijective in most cases. To quantify the errors from imperfect reconstruction, we assume no ordering
ambiguity in this section.
We model the imperfect reconstruction by assigning a non-zero variance to the reconstructed x̂i , and
we assume the mean value equals to the target xi . We assume x̂i follows Gaussian distribution (see
Remark E.8), and larger variances indicate worse reconstruction, and vice versa.
Figure 12 shows a schematic plot of an autoencoder that has imperfect reconstruction. The ideal
autoencoder that has perfect reconstruction are shown as the dashed lines in the plot, indicating the
bijectivity of the autoencoder. However, when the reconstruction is not perfect, we assume that the ĥi
is a sample from the distribution that are centered around hi .

fe*
Ideal fe*
*
xi fd hi Learned fe
fe
Ideal fd*
x̂i fd ĥi Learned fd

Figure 12: A schematic representation of the autoencoder under imperfect reconstruction.

Now, we want to understand, when observing the reconstruction error in the original N-D space, can
we estimate the distribution of the 1-D latent distribution? This motivates the following theorem.

Theorem C.3 (Latent Distribution Under Imperfect Reconstruction). Let σx2ˆi ∈ RN be the
component-wise variance vector of the reconstructed point x̂i , and xi = fd (hi ), x̂i = fd (ĥi ).
Assume that the encoder fe and decoder fd are Lipschitz continuous on X with constants Ke and Kd
respectively. If ĥi also follows a normal distribution, the following bounds hold for the mean µhˆi and
variance σh2ˆ of ĥi :
i

σxˆi
≤ µhˆi − hi ≤ Ke σxˆi (30)
Kd
1 2 2
σxˆi ≤ σh2ˆ ≤ 4Ke2 σxˆi (31)
Kd2 i

Proof. Consider the following chain of inequalities based on the Lipschitz continuity of fe and fd :

∥xi − x̂i ∥ = ∥fd (hi ) − fd (ĥi )∥ (32)


≤ Kd hi − ĥi . (33)

Now, we can also write the Lipschitz continuity for the encoder fe :

hi − ĥi = |fe (xi ) − fe (x̂i )| (34)


≤ Ke ∥xi − x̂i ∥. (35)

Combining the two inequalities, we get:


∥xi − x̂i ∥
≤ hi − ĥi ≤ Ke ∥xi − x̂i ∥ (36)
Kd
s.t. Ke Kd ≥ 1 (37)

28
Taking the squared expectation on both sides:
h i
E ∥xi − x̂i ∥2 h
2
i h i
2 2
≤ E h i − ĥ i ≤ K e · E ∥x i − x̂ i ∥ (38)
Kd2

Considering that E[x̂i ] = xi we have:


σx2ˆi = Var(x̂i ) (39)
h   i
= E (x̂i − E x̂i )2 (40)
= E (x̂i − xi )2
 
(41)

And we have
h i h i
E ∥xi − x̂i ∥2 = E (x̂i − xi )⊺ (x̂i − xi ) (42)

= E (x̂i − xi )2
 
(43)
N
X 2
= σx2ˆi ,j = σxˆi (44)
j=1

where σx2ˆi ,j is the j-th component of the variance vector σx2ˆi . Therefore we have the upper bound
and lower bound for the mean value of ĥi :
2
σxˆi h
2
i
2
≤ E hi − ĥi ≤ Ke2 · σxˆi (45)
Kd2
σxˆi
≤ µhˆi − hi ≤ Ke σxˆi (46)
Kd

Now, consider the variance of ĥi . From the Lipschitz continuity of the encoder and decoder, we can
use triangle inequality:
h   i
σh2ˆ = Var(ĥi ) = E (ĥi − E ĥi )2 = E (ĥi − µhˆi )2
 
(47)
i
h i
2
= E ĥi − hi + hi − µhˆi (48)
h i
2 2
≤ E fe (x̂i ) − fe (xi ) + µhˆi − hi (49)
h i
+ 2E fe (x̂i ) − fe (xi ) · µhˆi − hi (50)
h i
2
≤ Ke2 E |xi − x̂i |2 + Ke2 σxˆi (51)
h i
+ 2Ke2 σxˆi · E |xi − x̂i | (52)
2
= 4Ke2 σxˆi (53)

Conversely, using the Lipschitz continuity of the decoder, we can derive the lower bound of the
variance:
h   i
σh2ˆ = Var(ĥi ) = E (ĥi − E ĥi )2 = E (ĥi − µhˆi )2
 
(54)
i
h i
2
= E (ĥi − hi ) − (µhˆi − hi ) (55)

We can apply the Cauchy-Schwarz inequality to above, which gives:


h  i2  2    2 
E ĥi − hi ĥi − µĥi ≤ E ĥi − hi E ĥi − µĥi (56)

29
We can rearrange the inequality:
h  i2
 2  E ĥi − hi ĥi − µĥi
E ĥi − µĥi ≥  2  (57)
E ĥi − hi

Using the triangle inequality, we have:


h   i2 h   i2
E (ĥi − hi )(hi − µhˆi ) ≤ E |ĥi − hi ||hi − µhˆi | (58)
h i h i
≤ E (ĥi − hi )2 E (hi − µhˆi )2 (59)

Now, we can use this bound in the expression for the lower bound:
h  i2
 2  E ĥi − hi ĥi − µĥi
E ĥi − µĥi ≥  2  (60)
E ĥi − hi
h 2 i h 2 i
E ĥi − hi E hi − µhˆi
≥  2  (61)
E ĥi − hi
h 2 i
= E hi − µhˆi (62)

2 2
Note that E (hi − µhˆi )2 = µhˆi − hi ≥ σxˆi /Kd2 . Thus, we have obtained the bounds for the
 

variance of ĥi as follows:


1 2 2
· σxˆi ≤ σh2ˆ ≤ 4Ke2 σxˆi (63)
Kd2 i

Theorem C.3 provides bounds on the mean and variance of the 1-D representation in terms of the
Lipschitz continuity of the encoder and decoder and the component-wise variance of the reconstruction
error. To build theoretical bounds for the error from imperfect reconstruction, we need to consider
how the latent distribution affect the sorting results.
How Does Imperfect Reconstruction Affect Ordering Results?
We have discussed the distribution of 1-D representation values when the reconstruction is not perfect.
Connecting back to the ordering results, when two points xi and xj follow an ordering where
i < j, we want the learned 1-D latent to always satisfy ĥi < hˆj . However, since their 1-D latent
representations have non-zero variances, it is possible that ĥi ≥ hˆj , resulting in a swap between xi
and xj and leading to an error in the sorted sequence. Therefore we want to quantitatively measure
how much error can we expect from imperfect reconstruction.
Assuming for all x∗i in the target sequence Y ∗ , we have fe (x∗i ) = hˆ∗i ∼ N (µhˆ∗ , σh2ˆ∗ ) independent
i i

of each other (using local DR). For each pair of hˆ∗i and hˆ∗j where i ̸= j, the probability P (hˆ∗i > hˆ∗j )
is given by (see Remark E.2):
 
ˆ ˆ
q
∗ ∗ 2
P (hi > hj ) = 1 − Φ −(µhˆ∗ − µhˆ∗ )/ (σhˆ∗ + σhˆ∗ ) 2 (64)
i j i j

Here, Φ represents the cumulative distribution function (CDF) of the standard normal distribution.
Since the ordering results in Y is given by the ordering of latent representation, we can use Equation
(64) to calculate matrix P as follows.

30
 
Denote γij = P (hˆ∗i < hˆ∗j ) = Φ −(µhˆ∗ − µhˆ∗ )/ (σh2ˆ∗ + σh2ˆ∗ ) , we can calculate the value of
q
i j i j

each entry in P (see Lemma D.2) under the assumption of imperfect reconstruction as:
X Y h i Y h i
pij = P (hˆ∗j < hˆ∗k ) P (hˆ∗j > hˆ∗k ) (65)
Si ,Ti k∈Si k∈Ti
X Y   Y  
= γjk 1 − γjk (66)
Si ,Ti k∈Si k∈Ti

where the summation is over all possible partitions of the index set {1, 2, ..., M } \ i into two disjoint
subsets Si and Ti with |Si | = i − 1 and |Ti | = M − i.
An Intuitive Interpretation of Equation (66): To have the j-th element x∗j from the target sequence
Y ∗ to be the i-th element xi in the resulting sequence Y , the encoded 1-D representation of x∗j has
to be smaller than the 1-D latent of M − i number of points and it has to be larger than the 1-D latent
of i − 1 number of points. We need to consider all partitions to sum up the probabilities from all
possible orderings that has the j-th element from Y ∗ to be placed at i-th location in Y .
Note that Equation (66) assumes no ordering ambiguity is presented. If we want to get a more
accurate estimation of the error for latent sort algorithm, we can add ordering ambiguity in it. The
composite P matrix can be estimated by simply calculating a weighted summation over the two P
matrices from Equation (14) and Equation (66), then normalize the summed matrix to be a probability
matrix.

C.6 Simplifying The Errors From Imperfect Reconstruction

Equation (66) gives an estimation of the errors of latent sort algorithm using the matrix P . However,
without simplification, it is intractable to calculate the probability values due to the factoral number
of possible partitions Si and Ti . Here we will first provide a way to simplify Equation (66), then we
will use it to reveal the shortest path property of latent sort algorithm.
The simplification leverages the sparsity in the summation in Equation (66), based on the property
that if any term in the product is zero, the product becomes zero.
Q Q
Let Γj = k∈Si (γjk ) k∈Ti (1 − γjk ). In a partitions that yields Si and Ti among all possible
partitions, if ∃k ∈ Si , γjk → 0 or ∃k ∈ Ti , γjk → 1, the value of Γj → 0.
From Theorem C.3, we can see that the minimization on reconstruction loss, empirically measured
by ∥σx2ˆi ∥, can lead to the collapse of the distribution of hˆ∗i to a sinlge value hi (see Remark E.6).
Thus, when we train the auto-encoder with reconstruction loss, we can assumed the ∥σx2ˆi ∥ is
properly minimized (meaning it is bounded by a small constant), and the value of µhˆ∗ will not
i
deviate from hi too much, such that we can assume µhˆ∗ follows ascending order as hi does, i.e.,
i
µhˆ∗ ≤ µhˆ∗ ⇐⇒ i ≤ j.
i j

In addition, from Theorem C.2 and C.1, we know that neighboring points in the original space RN
tend to have similar 1-D latent values. Moreover, from the property of Φ as the CDF of normal
distribution, the value of Φ(x) quickly becomes infinitesimal when x starts to deviate from zero
towards negative infinite, similarly 1 − Φ(x) quickly becomes infinitesimal when x startsq to deviate
from zero towards positive infinite. As ∥σxˆi ∥ getting smaller in training, the value of (σh2ˆ∗ + σh2ˆ∗ )
2
i j
is also getting smaller, thus the value of γij will be more sensitive to the difference of mean values
µhˆ∗ − µhˆ∗ , pushing more and more values to zero and one.
i j

Therefore, we can replace the value of yij to be 0 or 1 when i and j are not neighbors, and we can
approximate γij as follows:


 1,
   if i < j − 1
q
2 2
γij ≈ Φ −(µhˆ∗ − µhˆ∗ )/ (σhˆ∗ + σhˆ∗ ) , if j − 1 ≤ i ≤ j + 1 (67)
 i j i j

0, if i > j + 1

31
Thus using Equation (67) and the definition of Si and Ti , we can simplify Equation (66) as follows:
X Y   Y  
pij = γjk 1 − γjk (68)
Si ,Ti k∈Si k∈Ti
     
 γj,(j−1) 1 − γj,(j+1) + γj,(j+1) 1 − γj,(j−1) , i=j
   
γ γ , i=j−1

=  j,(j−1) j,(j+1)   (69)

 1 − γ j,(j−1) 1 − γ j,(j+1) , i=j+1

0, i∈
/ [j − 1, j + 1]
Equation (69) gives a simplification of P matrix in the case of imperfect reconstruction, where
P becomes a tridiagonal matrix with non-zero values on the main diagonal and the diagonals
immediately above and below it, which looks like:
p11 p12 0 · · · 0
 
p21 p22 p23 · · · 0 
P =
 0 p
 32 p 33 · · · 0 
 (70)
.
 .. .
.. .
.. . .. .
.. 

0 0 0 · · · pM M
We can do a quick check to verify that P is still a probability matrix in the simplied form (see Remark
E.7).
Now we can use Equation (69) and Equation (70) to estimate
 the sorting error E. Lemma D.1 states
the equivalence in minimization between E E[Y ], Y ∗ and ∥P − IM ∥2F , thus we can calculate the
difference between P and IM using the simplified equation by summing the main diagonal difference
and the off-diagonal difference:
M
X M
X −1 h i
∥P − IM ∥2F = (pii − 1)2 + p2i(i+1) + p2(i+1)i (71)
i=1 i=1
M
X M
X M
X −1
= (pii − 1)2 + p2(i−1)i + p2(i+1)i (72)
i=1 i=2 i=1
Where the main diagonal:
2
(pii − 1)2 =
     
γi,(i−1) 1 − γi,(i+1) + γi,(i+1) 1 − γi,(i−1) − 1 (73)
And the diagonals immediately above and below the main diagonal:
2  2
p2(i−1)i = γi,(i−1) γi,(i+1)

(74)
2  2
p2(i+1)i = 1 − γi,(i−1) 1 − γi,(i+1)

(75)
Based on the above simplfication, now we can state the following theorem.

Theorem C.4. Define P using Equation (69). The minima of the error term ∥P − IM ∥2F occur at
γi,(i−1) = 0 and γi,(i+1) = 1 for all i ∈ {2, 3, ..., M − 1}.

Proof. Let x = γi,(i−1) and y = γi,(i+1) . To find the minima of the error term ∥P − IM ∥2F , we need
to differentiate the error term with respect to x and y, and find the values that minimize the error term.
Here is the error term we are working with:
M
X M
X M
X −1
∥P − IM ∥2F = (pii − 1)2 + p2(i−1)i + p2(i+1)i (76)
i=1 i=2 i=1

Using Equation (70) and exclude the cases that makes pij = 0, we have:

pii = x(1 − y) + y(1 − x),
p = xy, (77)
 (i−1)i
p(i+1)i = (1 − x)(1 − y).

32
Now, let’s differentiate the error term with respect to x and y. First, we will differentiate the error
term with respect to x:
M M
∂∥P − IM ∥2F X ∂pii X h ∂p(i−1)i i
= 2(pii − 1) + 2p(i−1)i (78)
∂x i=1
∂x i=2
∂x
M −1 h
X ∂p(i+1)i i
+ 2p(i+1)i (79)
i=1
∂x

Now, we differentiate pii , pi(i+1) , and p(i+1)i with respect to x:


 ∂p
 ∂x = (1 − y) − y = 1 − 2y,
ii

∂p(i−1)i
∂x = y, (80)
 ∂p(i+1)i

∂x = y − 1.

Plugging these derivatives back into the expression for the partial derivative with respect to x, we get:
M h M h
∂∥P − IM ∥2F X i X i
= 2(pii − 1)(1 − 2y) + 2p(i−1)i (y) (81)
∂x i=1 i=2
M
X −1 h i
+ 2p(i+1)i (y − 1) (82)
i=1
M h
X i
= 2(x + y − 2xy − 1)(1 − 2y) (83)
i=1
M h
X i MX−1 h i
+ 2xy 2 + 2(1 − x − y + xy)(y − 1) (84)
i=2 i=1

Since x and y are between [0, 1], for the above partial derivative to be zero, either x = 0, y = 1 or
x = 1, y = 0.
Next, we will differentiate the error term with respect to y. Notice that symmetry in Equation (77),
thus similarly we have:
M
∂∥P − IM ∥2F Xh i
= 2(x + y − 2xy − 1)(1 − 2x) (85)
∂y i=1
M h
X i MX−1 h i
+ 2yx2 + 2(1 − x − y + xy)(x − 1) (86)
i=2 i=1

Again, to make the above derivative zero, we have x = 0, y = 1 and x = 1, y = 0 are two possible
solutions. Thus, by setting the two partial derivatives to zero, we can see that the two minima of
∥P − IM ∥2F are x = 0, y = 1 and x = 1, y = 0.
However, recall the definition x = γi,(i−1) and y = γi,(i+1) , and the definition of γij = P (hˆ∗i < hˆ∗j ).
Since Equation (67)(69)(70) are defined over every row of P , which correspond to every position in
the sorted sequence, we can see that the two minima represent two different orderings in the sorted
sequence, where x = 0, y = 1 gives ascending order and x = 1, y = 0 gives descending order. Since
in our discussion we only consider ascending order in the target sequence, x = 0, y = 1 is the only
minimum that satisfy our definitions.

Theorem C.4 implies that in order to minimize E, we can minimize ∥P − IM ∥2F instead (see Lemma
D.1) by minimizing an equivalent objective γi,(i−1) − γi,(i+1) for all i = 2, ..., M − 1, where
both γi,(i−1) and γi,(i+1) are between 0 and 1. Use the definition of γij , rewrite this alternative

33
minimization problem on ∀i ∈ {2, 3, ..., M − 1} as:
min γi,(i−1) − γi,(i+1) (87)
f
   
µhˆ∗ − µh∗ˆ µhˆ∗ − µh∗ˆ
= min Φ  − q i i−1
 − Φ  −q i i+1
 (88)
f (σh2ˆ∗ + σh2 ∗ˆ ) (σh2ˆ∗ + σh2 ∗ˆ )
i i−1 i i+1
   
µhˆ∗ − µh∗ˆ µhˆ∗ − µh∗ˆ
= min Φ  q i i+1
 − Φ q i i−1
 (89)
f (σh2ˆ∗ + σh2 ∗ˆ ) (σh2ˆ∗ + σh2 ∗ˆ )
i i+1 i i−1

Assuming the 1-D latents are normalized to [0, 1]. From the property of CDF Φ, we know that (89)
reaches minima -1 only when the reconstruction is perfect such that the variances σh∗ˆ , σhˆ∗ , σh∗ˆ
i−1 i i+1
are zero.

C.7 Optimal Sorting From The Perspective of Shortest Path Problem

In previous theories, we have discussed the error of Y to the target sorting Y ∗ using the probability
matrix P . However, the target sorting is not yet defined. The optimal sorting need to have properties
to minimize the errors induced by imperfect P matrices. Let us first discuss the desired properties of
Y ∗.
Recall that E[Y ] = P Y ∗ . From Equation (1) and (70), we have
 
E E[Y ], Y ∗ (90)
∗ ∗ 2
= ∥P Y − Y ∥F (91)
2
M
X M
X
= pij x∗j − x∗i (92)
i=1 j=1
F
M
X 2
= pi(i−1)) x∗i−1 + (pii − 1)x∗i + pi(i+1)) x∗i+1 F
(93)
i=1
M
X 2
pi(i−1)) x∗i−1 + − pi(i−1)) − pi(i+1)) x∗i + pi(i+1)) x∗i+1
 
= (94)
F
i=1
M
X 2
= pi(i−1)) (x∗i−1 − x∗i ) + pi(i+1)) (x∗i+1 − x∗i ) F
(95)
i=1

Recall that f is a local DR that maps each point in X to its 1-D latent representation independent
from the other elements in the sequence. That being said, we  expect (95) to be minimized for any
three points {x∗i−1 , x∗i , x∗i+1 } ⊂ X . To make E E[Y ], Y ∗ minimized for all possible X ⊆ RN ,
we need to minimize the upper bound of (95), giving by:
M
X 2
pi(i−1)) (xi−1 − xi ) + pi(i+1)) (xi+1 − xi ) F
(96)
i=1
M
X 2 2
≤ pi(i−1)) (x∗i−1 − x∗i ) F
+ pi(i+1)) (x∗i+1 − x∗i ) F
(97)
i=1

Assuming the matrix P is already given for a sorting algorithm using a DR mapping f and both
pi(i−1)) and pi(i+1)) are not zero for each i. From (97), we can see that the ideal sorting Y ∗ need
to have minimal (x∗i−1 − x∗i ) and (x∗i+1 − x∗i ). An intuitive interpretation is that the ideal sorting
Y ∗ need to have the property of minimal distance between each neighboring pairs, so that the errors
from imperfect P will be minimized.

34
This property has an interesting connection to the shortest path problem or traveling salesman problem
(TSP), since a sorting an be seen as a path that traverse all the points in a given set. The desired
target sorting is given by the solution from solving TSP on the input point set X . However, TSP is
NP-hard and requires expensive algorithms to solve it. Furthermore, the autoregressive models faces
significant difficulties to learn the ordering rule represented by the solution of TSP.
Why Autoregressive Models Struggle to Find The Shortest Path?
The TSP is inherently a problem that requires a global understanding of the entire point set to solve.
An optimal solution necessitates considering all the points simultaneously, understanding the distance
and relationship between each pair of points, and calculating a path that minimally covers all points.
On the contrary, autoregressive models are inherently local and sequential in their processing. They
generate output one token at a time and in a specific order, each time only attending to previous
tokens in the sequence. They do not consider future tokens or have an overview of the entire set of
tokens at each generation step.
Thus, the local, one-sided nature of autoregressive models is at odds with the global problem-solving
requirement of the TSP, leading to significant challenges when attempting to learn the ordering rule
represented by the solution of TSP.

D Supplementary Lemmas
In this section, we provide some lemmas that help to understand the theoretical results discussed in
Section C.
Lemma D.1. Let Y ∗ be an M × N matrix with full column rank, and let P be an M × M probability
matrix, i.e., the summation of every row and column in P is one. If we minimize ∥P − IM ∥2F , we
can solve the minimizing problem of minP ∥P Y ∗ − Y ∗ ∥2F .

Proof. We want to show that by minimizing ∥P − IM ∥2F , we can minimize ∥P Y ∗ − Y ∗ ∥2F . Here we
know P is probability matrix, so the set of all possible P is a compact set in the bounded space, i.e.,
P ∈ [0, 1]M ×M , and the objective function is continuous in that space. In this case, the Weierstrass
theorem ensures the existence of a minimum.
First, consider the objective function we want to minimize:
min ∥P Y ∗ − Y ∗ ∥2F (98)

Now, let’s expand the Frobenius norm:


∥P Y ∗ − Y ∗ ∥2F = tr((P Y ∗ − Y ∗ )T (P Y ∗ − Y ∗ )) (99)

Expanding this expression, using the property of the trace tr(AB) = tr(BA), and regrouping the
terms, we have:
tr((P Y ∗ − Y ∗ )T (P Y ∗ − Y ∗ )) = tr((P − IM )Y ∗T Y ∗ P T + (IM − P )Y ∗ Y ∗T ) (100)

Notice that the expression above is minimized when (P − IM )Y ∗T Y ∗ P T = 0 and (IM −


P )Y ∗ Y ∗T = 0. Both of these conditions are satisfied when P = IM . This is because, when
P = IM , we have:

(IM − IM )Y ∗T Y ∗ IM
T
=0 (101)

and

(IM − IM )Y ∗ Y ∗T = 0 (102)

Thus, by minimizing ∥P − IM ∥2F , we can solve the minimizing problem of min ∥P Y ∗ − Y ∗ ∥2F .

35
Lemma D.2. Given a set of M random variables x1 , x2 , ..., xM ∈ X that follow normal distributions
with potentially different means and variances, let y1 ≤ y2 ≤ · · · ≤ yM represent the sorted sequence
of these random variables. Define cij = P (xi > xj ) for all i, j ∈ {1, 2, ..., M } with i ̸= j. Then,
the probability distribution of the k-th element, yk , in the sorted sequence can be expressed as:
XY Y
P (yk = xi ) = (1 − cij ) cij (103)
S,T j∈S j∈T

where the summation is over all possible partitions of the set {1, 2, ..., M } \ i into two disjoint subsets
S and T with |S| = k − 1 and |T | = M − k.

Proof. Let’s denote the sorted sequence of the independent random variables as [y1 , y2 , ..., yM ],
where y1 ≤ y2 ≤ · · · ≤ yM . We want to find the probability distribution of the k-th element, yk , in
this sorted sequence.
First, note that for any two random variables xi and xj , the probability that xi > xj is given by cij .
The complementary probability, that xi ≤ xj , is given by 1 − cij .
Let’s consider the probability that a specific random variable, xi , is the k-th element in the sorted
sequence. For this to happen, there must be exactly k − 1 random variables that are less than or equal
to xi and M − k random variables that are greater than xi .
Since the random variables are independent from each other, the probability of k − 1 random variables
being less than or equal to xi can be calculated as the product of probabilities 1 − cij for all j ̸= i
and j ∈ S, where S is a subset of {1, 2, ..., M } \ {i} with |S| = k − 1, where \ represents relative
complement or set difference. The probability of M − k random variables being greater than xi
can be calculated as the product of probabilities cij for all j ̸= i and j ∈ T , where T is a subset of
{1, 2, ..., M } \ {i} with |T | = M − k.
Therefore, the probability that xi is the k-th element in the sorted sequence is:
XY Y
P (yk = xi ) = (1 − cij ) cij (104)
S,T j∈S j∈T

where the summation is over all possible partitions of the set {1, 2, ..., M } \ {i} into two disjoint
subsets S and T with |S| = k − 1 and |T | = M − k.

E Supplementary Remarks and Comments


In this section, we provide some remarks and comments that are useful for readers to understand
some claims that are used in Section C.
Remark E.1. Sorting on 1-D space R is well-defined.
Explanation: A total order ≤ on a set S is a binary relation that satisfies the following properties for
all a, b, c ∈ R:

• Reflexivity: a ≤ a.
• Antisymmetry: If a ≤ b and b ≤ a, then a = b.
• Transitivity: If a ≤ b and b ≤ c, then a ≤ c.
• Totality (or connexity): Either a ≤ b or b ≤ a.

In the 1-D space, i.e., on the real numbers R, the usual order ≤ is a total order that satisfies these
properties. Therefore, for any two real numbers a and b, we can unambiguously compare them using
the order relation ≤. This implies that sorting on 1-D space is well-defined, as the total order allows
us to determine the correct order of any pair of real numbers.
Furthermore, the well-ordering property of the integers Z and the dense nature of the rational numbers
Q within the real numbers R provide a solid foundation for sorting algorithms on 1-D space, as these
properties allow us to efficiently find the desired order of elements and guarantee the existence of a
unique sorted sequence.

36
Remark E.2. For two random variables a ∼ N (µa , σa2 ) and b ∼ N (µb , σb2 ), the probability
P (a > b) monotonically increases as µa − µb increases, and P (a > b) monotonically decreases as
σa2 + σb2 increases.
Explanation: When two random variables a and b are from normal distributions and are independent
from each other, the difference between the two variables can be described by another normal
distribution. Let c = a − b. Since a and b are independent, the mean and variance of c can be
computed as follows:
Mean of c :µc = µa − µb (105)
Variance of c :σc2 = σa2 + σb2 (106)

Thus, c follows a normal distribution with mean µc and variance σc2 .


Now, to find the probability that a > b is equivalent to finding the probability that c > 0. We can
compute this using the CDF of the normal distribution.
Let Z be the standard normal variable, such that Z = (c − µc )/σc . The probability we want to find
is P (c > 0), which is equivalent to finding P (Z > −µc /σc ), since Z is standardized.
Using the standard normal CDF (Φ), we can find the probability as:
P (a > b) = P (Z > −µc /σc ) (107)
= 1 − Φ(−µc /σc ) (108)
 q 
= 1 − Φ −(µa − µb )/ (σa2 + σb2 ) (109)

Here, Φ represents the cumulative distribution function (CDF) of the standard normal distribution.

Remark E.3. For two sets X ⊂ RN and H ⊂ RM where N > M , there may and not always exists
a bijective mapping between them.
Explanation: We can think from the perspective of the cardinality, Lebesgue measure, and intrinsic
dimension of the two sets X and H.
• Finite sets: If both X and H are finite, there exists a bijective mapping between them if and
only if they have the same cardinality (number of elements). In formal terms, a bijection
f : X → H exists if |X | = |H|.
• Countably infinite sets: If one or both of the sets are countably infinite, there exists a bijective
mapping between them if both sets have the same cardinality, which is the cardinality (aleph
number) of the set of natural numbers, ℵ0 . In this case, a bijection f : X → H exists if
|X | = |H| = ℵ0 .
• Uncountably infinite sets: If one or both of the sets are uncountably infinite, we need to
consider the cardinality, Lebesgue measure, and intrinsic dimension of the sets.
- Cardinality: A bijection between the sets can exist if both sets have the same cardinality.
For example, if both sets have the cardinality of the continuum (ℵ1 or 2ℵ0 ), there exists a
bijection f : X → H.
- Lebesgue measure: When comparing sets with different Lebesgue measures, it is more
challenging to find a bijection that preserves the local properties of the spaces. For instance,
a bijection between a 3D cube with finite volume and a 2D plane with infinite area might
not be meaningful in terms of preserving the local properties of the spaces.
- Intrinsic dimension: If X is an N -D manifold and H is an M -D manifold, with M < N ,
a bijective mapping between them can exist if the intrinsic dimension of the spaces is the
same. Specifically, if the intrinsic dimension of X is equal to the intrinsic dimension of H,
it is possible to find a continuous, bijective function f : X → H that preserves the local
properties of the spaces. [Lexicographical sort dimension 1 space-filling curve]
In summary, we can see that under certain conditions, there exists a bijective mapping between X
and H.

37
Remark E.4. A deterministic neural network can be represented by a surjective mapping.
Explanation: A deterministic neural network on real-space can be expressed by f : X → Y that
maps an input space X to an output space Y using a fixed set of weights and biases, and no stochastic
components are involved.
Assuming Y is the complete set for input set X . For every element y ∈ Y, there exists at least one
element x ∈ X such that f (x) = y. One the other hand, for each x ∈ X , there cannot be more than
one y ∈ Y such that y = f (x), unless there are stocasticity in the model.

Remark E.5. If an autoencoder has perfection reconstruction, both the encoder and decoder of it are
bijective.
Explanation: A bijective mapping is both injective (one-to-one) and surjective (onto).
Let E : X → Z be the encoder function and D : Z → X be the decoder function. The autoencoder
achieves perfect reconstruction if for any input x ∈ X , we have D(E(x)) = x.
• Injectivity:
(a) Encoder: We want to show that if x1 , x2 ∈ X and x1 ̸= x2 , then E(x1 ) ̸= E(x2 ).
Suppose, for the sake of contradiction, that E(x1 ) = E(x2 ). Then, we have:
D(E(x1 )) = D(E(x2 ))
x1 = x2
This contradicts the assumption that x1 ̸= x2 . Therefore, E(x1 ) ̸= E(x2 ), and the encoder
is injective.
(b) Decoder: We want to show that if z1 , z2 ∈ Z and z1 ̸= z2 , then D(z1 ) ̸= D(z2 ).
Since the encoder is injective, for z1 ̸= z2 , there exist distinct inputs x1 , x2 ∈ X such that
E(x1 ) = z1 and E(x2 ) = z2 . Then, we have:
D(z1 ) = D(E(x1 ))
D(z2 ) = D(E(x2 ))
x1 ̸= x2
Hence, D(z1 ) ̸= D(z2 ), and the decoder is injective.
• Surjectivity:
(a) Encoder: We want to show that for every point z ∈ Z, there exists an input x ∈ X
such that E(x) = z. Since the autoencoder achieves perfect reconstruction, for every input
x ∈ X , we have D(E(x)) = x. Let z = E(x) for some x ∈ X . Then, the encoder covers
the entire latent space, and it is surjective.
(b) Decoder: We want to show that for every point x ∈ X , there exists a point z ∈ Z
such that D(z) = x. Since the autoencoder achieves perfect reconstruction, for every input
x ∈ X , we have D(E(x)) = x. Let z = E(x) for some x ∈ X . Then, the decoder covers
the entire input space, and it is surjective.
In conclusion, if an autoencoder achieves perfect reconstruction, both the encoder and decoder
functions are bijective mappings. The perfect reconstruction property ensures that both functions are
injective and surjective.

Remark E.6. In Theorem C.3, when ∥σx2ˆi ∥ → 0, we have ĥi → hi .

Explanation: As ∥σx2ˆi ∥ → 0, we have σxˆi → 0. Let σ = σxˆi , we can write:


σ
lim =0 (110)
σ→0 Kd
lim Ke σ = 0 (111)
σ→0
1 2
lim σ =0 (112)
σ→0 K 2
d
lim 4Ke2 σ 2 = 0 (113)
σ→0

38
Now, using the sandwich theorem (squeeze theorem), we can show that:
0 ≤ lim µhˆi − hi ≤ 0 (114)
σ→0
0 ≤ lim σh2ˆ ≤ 0 (115)
σ→0 i

Since the only value satisfying both inequalities is 0, we can conclude that when σ → 0, we have
lim µhˆi = hi (116)
σ→0
lim σh2ˆ = 0 (117)
σ→0 i

Thus, the distribution of ĥi is collapsing to a single value hi .

Remark E.7. The simplified matrix given by Equation (69) and Equation (70) is a probability matrix.

Explanation: To prove that P is a probability matrix, we need to show that:

1. Each entry pij is between 0 and 1.


2. The sum of the elements in each row and each column is equal to 1.

Denote x = γj,(j−1) and y = γj,(j+1) and we know x and y are both between 0 and 1. It is clear that
each entry will be between 0 and 1, satisfying the first condition.
Now, let’s consider the sum of the elements in each row. Let We will sum over i for a fixed value of j:
M
X
pij = pj−1,j + pj,j + pj+1,j (118)
i=1
= xy + x(1 − y) + y(1 − x) + (1 − x)(1 − y) (119)
= x(1 − y) + y(1 − x) + xy + 1 − x − y + xy (120)
= 1 − x − y + 2xy + x + y (121)
=1 (122)

Now, let’s check the sum of the elements in each column. We will sum over j for a fixed value of i:
M
X
pij = pi,i−1 + pi,i + pi,i+1 (123)
j=1

Note that the expression for pij is symmetric with respect to i and j. Thus, the sum over columns is
the same as the sum over rows, which we have already shown to be 1.
Since we have proven that each entry pij is between 0 and 1, and the sum of elements in each row
and each column is 1, we can conclude that P is a probability matrix.
Remark E.8. We assume the reconstructed x̂i follows a Gaussian distribution, where larger variances
indicating worse reconstruction.

Explanation: This assumption is plausible when employing the Mean Squared Error (MSE) as a
measure of reconstruction loss. This Gaussian assumption aligns with the nature of MSE. The
MSE is an estimator that measures the average squared differences between estimated and true
values, essentially quantifying variance around the mean. Assuming a Gaussian distribution of
the reconstructed x̂i aligns with the statistical properties of MSE, as the Gaussian distribution is
parametrized by the mean and variance, mirroring the way MSE operates.
Hence, larger variances in the Gaussian distribution of x̂i would indicate a worse reconstruction
due to higher dispersion from the mean (true) value, and conversely, smaller variances suggest a
better reconstruction. This perspective offers a statistical rationale for evaluating the quality of the
reconstruction process.

39

You might also like