0% found this document useful (0 votes)
5 views

(BDI) Bidirectional Learning For Offline Infinite-Width Model-Based Optimization

Uploaded by

Cường Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

(BDI) Bidirectional Learning For Offline Infinite-Width Model-Based Optimization

Uploaded by

Cường Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Bidirectional Learning for Offline Infinite-width

Model-based Optimization

Can (Sam) Chen1∗, Yingxue Zhang2 , Jie Fu3 , Xue Liu1 , Mark Coates1
1
McGill University, 2 Huawei Noah’s Ark Lab, 3 Beijing Academy of Artificial Intelligence
[email protected], [email protected],
[email protected], [email protected], [email protected]

Abstract

In offline model-based optimization, we strive to maximize a black-box objective


function by only leveraging a static dataset of designs and their scores. This prob-
lem setting arises in numerous fields including the design of materials, robots, DNA
sequences, and proteins. Recent approaches train a deep neural network (DNN) on
the static dataset to act as a proxy function, and then perform gradient ascent on
the existing designs to obtain potentially high-scoring designs. This methodology
frequently suffers from the out-of-distribution problem where the proxy function
often returns poor designs. To mitigate this problem, we propose BiDirectional
learning for offline Infinite-width model-based optimization (BDI). BDI consists
of two mappings: the forward mapping leverages the static dataset to predict the
scores of the high-scoring designs, and the backward mapping leverages the high-
scoring designs to predict the scores of the static dataset. The backward mapping,
neglected in previous work, can distill more information from the static dataset
into the high-scoring designs, which effectively mitigates the out-of-distribution
problem. For a finite-width DNN model, the loss function of the backward mapping
is intractable and only has an approximate form, which leads to a significant dete-
rioration of the design quality. We thus adopt an infinite-width DNN model, and
propose to employ the corresponding neural tangent kernel to yield a closed-form
loss for more accurate design updates. Experiments on various tasks verify the
effectiveness of BDI. The code is available here.

1 Introduction
Designing a new object or entity with desired properties is a fundamental problem in science and
engineering [1]. From material design [2] to protein design [3, 4], many works rely on interactive
access with the unknown objective function to propose new designs. However, in real-world scenarios,
evaluation of the objective function is often expensive or dangerous [2–6], and thus it is more
reasonable to assume that we only have access to a static (offline) dataset of designs and their scores.
This setting is called offline model-based optimization. We aim to find a design to maximize the
unknown objective function by only leveraging the static dataset.
A common approach to solve this problem is to train a deep neural network (DNN) model on the
static dataset to act as a proxy function parameterized as fθ (·). Then the new designs are obtained by
performing gradient ascent on the existing designs with respect to fθ (·). This approach is attractive
because it can leverage the gradient information of the DNN model to obtain improved designs.
Unfortunately, the trained DNN model is only valid near the training distribution and performs poorly

Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).


outside of the distribution. More specifically, the designs generated by directly optimizing fθ (·) are
scored with erroneously high values [7, 8]. As shown in Figure 1, the proxy function trained on the
static dataset p1,2,3 overestimates the ground truth objective function, and the seemingly high-scoring

design pgrad obtained by gradient ascent has a low ground truth score pgrad .
Recent works [7–9] address the out-of-distribution problem by imposing effective inductive biases on
the DNN model. The method in [8] learns a proxy function fθ (·) that lower bounds the ground truth
scores on out-of-distribution designs. The approach in [9] bounds the distance between the proxy
function and the objective function by normalized maximum likelihood. In [7], Yu et al. use a local
smoothness prior to overcome the brittleness of the proxy function and thus avoid poor designs.
As demonstrated in Figure 1, these previous
works try to better fit the proxy function to the
ground truth function from a model perspective,
and then obtain high-scoring designs by gradient
ascent regarding the proxy function. In this pa-
per, we investigate a method that does not focus
on regularizing the model, but instead ensures
that the proposed designs can be used to predict
the available data. As illustrated in Figure 1, by
ensuring the high-scoring design pours can pre-
dict the scores of the static dataset p1,2,3 (back-
ward) and vice versa (forward), pours distills
more information from p1,2,3 , which makes
pours more aligned with p1,2,3 , leading to p′ours
with a high ground truth score.
We propose BiDirectional learning for offline Figure 1: Illustration of motivation.
Infinite-width model-based optimization (BDI)
between the high-scoring designs and the static dataset (a.k.a. low-scoring designs). BDI has a
forward mapping from low-scoring to high-scoring and a backward mapping from high-scoring to
low-scoring. The backward mapping means the DNN model trained on the high-scoring designs
is expected to predict the scores of the static dataset, and vice versa for the forward mapping. To
compute the prediction losses, we need to know the scores of the high-scoring designs. Inspired
by [10], where a predefined reward (score) is used for reinforcement learning, we set a predefined
target score for the high-scoring designs. By minimizing the prediction losses, the bidirectional
mapping distills the knowledge of the static dataset into the high-scoring designs. This ensures that
the high-scoring designs are more aligned with the static dataset and thus effectively mitigates the out-
of-distribution problem. For a finite-width DNN model, the loss function of the backward mapping
is intractable and can only yield an approximate form, which leads to a significant deterioration
of the design quality. We thus adopt a DNN model with infinite width, and propose to adopt the
corresponding neural tangent kernel (NTK) [11, 12] to produce a closed-form loss function for more
accurate design updates. This is discussed in Section 4.5. Experiments on multiple tasks in the
design-bench [1] verify the effectiveness of BDI.
To summarize, our contributions are three-fold:

• We propose bidirectional learning between the high-scoring designs and the static dataset, which
effectively mitigates the out-of-distribution problem.
• To enable a closed-form loss function, we adopt an infinite-width DNN model and introduce NTK
into the bidirectional learning, which leads to better designs.
• We achieve state-of-the-art results on various tasks, which demonstrates the effectiveness of BDI.

2 Preliminaries
2.1 Offline Model-based Optimization

Offline model-based optimization is formally defined as:


x∗ = arg max f (x) , (1)
x

2
where the objective function f (x) is unknown, but we have access to a size N dataset D =
{(x1 , y1 )}, · · · , {(xN , yN )}, where xi represents a certain design and yi denotes the design score.
The design could, for example, be a drug, an aircraft or a robot morphology. Denote the feature
dimension of a design x as D and then the static dataset D can also have a vector representation
(Xl , yl ) where Xl ∈ RN×D , yl ∈ RN .
A common approach is fitting a DNN model fθ (·) with parameters θ to the static dataset:
N
1 X
θ ∗ = arg min (fθ (xi ) − yi )2 . (2)
θ N i=1

After that, the high-scoring design x∗ can be obtained by optimizing x against the proxy function
fθ∗ (x) by gradient ascent steps:
xt+1 = xt + η∇x fθ (x)|x=xt , for t ∈ [0, T − 1] , (3)
where T is the number of steps. The high-scoring design x∗ can be obtained as xT . This simple
gradient ascent method suffers from the out-of-distribution problem — the proxy function fθ (x) is
not accurate for unseen designs, and its actual score is usually considerably lower than predicted [8].

2.2 Infinitely Wide DNN and Neural Tangent Kernel

Deep infinitely wide neural network models have recently received substantial attention. Previous
works establish the correspondence between infinitely wide neural networks and kernel methods [11–
14]. In the infinite width assumption, wide neural networks trained by gradient descent with a squared
loss yield a neural tangent kernel (NTK), which is specified by the network architecture. More
specifically, given two samples (designs) xi and xj , the NTK k(xi , xj ) measures the similarity
between the two samples from the perspective of the DNN model.
The NTKs have outperformed their finite network counterparts in various tasks and achieved state-of-
the-art performance for classification tasks such as CIFAR10 [15]. The approach in [16] leverages
the NTK to distill a large dataset into a small one to retain as much of its information as possible and
the method in [17] uses the NTK to learn black-box generalization attacks against DNN models.

3 Method
In this section, we present BiDirectional learning for offline Infinite-width model-based optimiza-
tion (BDI). First, to mitigate the out-of-distribution problem, we introduce bidirectional learning
between the high-scoring designs and the static dataset in Section 3.1. For a finite-width DNN model,
the loss function is intractable. Approximate solutions lead to significantly poorer designs. We
therefore propose the adoption of an infinite-width DNN model, and leverage its NTK to yield a
closed-form loss and more accurate updates in Section 3.2.

3.1 Bidirectional Learning

We use Xh ∈ RM×D to represent the high-scoring designs, where M represents the number of
high-scoring designs. The high-scoring designs Xh are potentially outside of the static dataset
distribution. To mitigate the out-of-distribution problem, we propose bidirectional learning between
the designs and the static dataset, which leverages one to predict the other and vice versa. This
can distill more information from the static dataset into the high-scoring designs. To compute the
prediction losses, we need to know the scores of the high-scoring designs. Inspired by [10], in which
a reward (score) is predefined for reinforcement learning, we set a predefined target score yh for
every high-scoring design. More precisely, we normalize the scores of the static dataset to have unit
Gaussian statistics following [8] and then choose yh to be larger than the maximal score in the static
dataset. We set yh as 10 across all tasks but demonstrate that BDI is robust to different choices in
Section 4.6. Thus, the high-scoring designs can be written as (Xh , yh ), and the goal is to find Xh .
Forward mapping. Similar to traditional gradient ascent methods, the forward mapping leverages
the static dataset to predict the scores of the high-scoring designs. To be specific, the DNN model
fθl (·) trained on (Xl , yl ) is encouraged to predict the scores yh given Xh . Thus the forward loss

3
function can be written as:
1
Ll2h (Xh ) = ∥yh − fθl ∗ (Xh )∥2 , (4)
M
where θ ∗ is given by
1 β
θ ∗ = arg min ∥yl − fθl (Xl )∥2 + ∥θ∥2 , (5)
θ N N
where β>0 is a fixed regularization parameter. Then we can minimize Ll2h (Xh ) against Xh to
update the high-scoring designs. Ll2h (Xh ) is similar to the gradient ascent process in Eq.(3); the only
difference is that gradient ascent aims to gradually increase the score prediction whereas minimizing
Ll2h (Xh ) pushes the score prediction towards the (high) predefined target score yh .
Backward mapping. Similarly, the DNN model fθh (·) trained on (Xh , yh ) should be able to predict
the scores yl given Xl . The backward loss function is
1
Lh2l (Xh ) = ∥yl − fθh∗ (Xh ) (Xl )∥2 , (6)
N
where θ ∗ (Xh ) is given by
1 β
θ ∗ (Xh ) = arg min ∥yh − fθh (Xh )∥2 + ∥θ∥2 . (7)
θ M M

Overall loss. The forward mapping and the backward mapping together align the high-scoring
designs with the static dataset, and the BDI loss can be compactly written as:
L(Xh ) = Ll2h (Xh ) + Lh2l (Xh ) , (8)
where we aim to optimize the high-scoring designs Xh .

3.2 Closed-form Solver via Neural Tangent Kernel

For a finite-width DNN model fθh (·), the high-scoring designs Xh of Eq.(6) only exist in θ ∗ (Xh )
from Eq.(7), which is intractable and does not have a closed-form solution. The solution of Eq.(7) is
often approximately obtained by a gradient descent step [18, 19],
η ∂∥yh − fθh (Xh )∥2
θ ∗ (Xh ) = θ − , (9)
M ∂θ
where η denotes the learning rate. This approximate solution θ ∗ (Xh ) leads to inaccurate updates to
the designs Xh through Eq.(6) and leads to much poorer designs, especially for high-dimensional
settings, as we illustrate in Section 4.5.
We thus adopt a DNN model fθ (·) with infinite width, and propose to leverage the corresponding
NTK to produce a closed-form θ ∗ (Xh ) [11, 12] and then a closed-form loss function of Eq.(6).
Neural tangent kernel. Denote by k(xi , xj ) the kernel function between xi and xj introduced by
the infinite-width DNN fθ (·). We use KXl Xl ∈ RN×N , KXh Xh ∈ RM×M , KXl Xh ∈ RN×M
and KXh Xl ∈ RM×N to represent the corresponding covariance matrices induced by the kernel
function k(xi , xj ). By leveraging the closed-form θ ∗ (Xh ) given by [11, 12], we can compute the
closed-form Lh2l (Xh ) in Eq.(6) as:
1
Lh2l (Xh ) = ∥yl − KXl Xh (KXh Xh + βI)−1 yh ∥2 . (10)
N

Similar to [20], to focus on higher-scoring designs in the static dataset, we assign a weight to every
design based on its score and recompute the loss:
Lh2l (Xh ) = ∥ωl · (yl − KXl Xh (KXh Xh + βI)−1 yh )∥2 , (11)
p
where ωl = softmax(αyl ) represents the design weight vector and the weight parameter α≥0 is a
constant. Similarly, we have the forward loss function,
Ll2h (Xh ) = ∥ωh · (yh − KXh Xl (KXl Xl + βI)−1 yl )∥2 , (12)

4
where ωh = √1 since every element in yh is the same in our paper.
M

Besides the advantage of the closed-form computation, introducing the NTK into the bidirectional
learning can also avoid the expensive high-order derivative computations brought by Eq.(9), which
makes our BDI an efficient first-order optimization method.
Analysis of M=1. When we aim for one high-scoring design (M = 1), Eq.(11) has a simpler form:
Lh2l (xh ) = ∥ωl · (yl − kXl xh (kxh xh + β)−1 yh )∥2 ,
N
X
2 k(Xli , xh ) (13)
= ωli (yli − yh )2 ,
i=1
kxh xh + β

where k(Xli , xh ) measures the similarity between the ith design of Xl and the high-scoring design
xh . Recall that yh represents the predefined target score and is generally much larger than yli . By
minimizing Lh2l (xh ), we aim to find a high-scoring design xh such that any design in the static
dataset similar to the high scoring design is encouraged to have a high score prediction. In this way,
xh tries to incorporate as many high-scoring features from the static dataset as possible.
Optimization. Combining the two losses, the overall loss can be expressed as:
1
L(Xh ) = (∥ωh · (yh − KXh Xl (KXl Xl + βI)−1 yl )∥2
2 (14)
+ ∥ωl · (yl − KXl Xh (KXh Xh + βI)−1 yh )∥2 ) ,
where only the high-scoring designs Xh are learnable parameters in the whole procedure of BDI. We
optimize Xh against L(Xh ) by gradient descent methods such as Adam [21].

4 Experiments
We conduct extensive experiments on design-bench [1], and aim to answer three research questions:
(1) How does BDI compare with both recently proposed offline model-based algorithms and traditional
algorithms? (2) Is every component necessary in BDI? (i.e., Lh2l (Xh ), Ll2h (Xh ), NTK.) (3) Is BDI
robust to the hyperparameter choices including the predefined target score yh , the weight parameter
α and the number of steps T?

4.1 Dataset and Evaluation

To evaluate the effectiveness of BDI, we adopt the commonly used design-bench, including both
continuous and discrete tasks, and the evaluation protocol as in the prior work [8].
Task overview. We conduct experiments on four continuous tasks: (a) Superconductor (SuperC) [2],
where the aim is to design a superconductor with 86 continuous components to maximize critical
temperature with 17010 designs; (b) Ant Morphology (Ant) [1, 22], where the goal is to design
the morphology of a quadrupedal ant with 60 continuous components to crawl quickly with 10004
designs, (c) D’Kitty Morphology (D’Kitty) [1, 23], where the goal is to design the morphology of a
quadrupedal D’Kitty with 56 continuous components to crawl quickly with 10004 designs; and (d)
Hopper Controller (Hopper) [1], where the aim is to find a neural network policy parameterized by
5126 total weights to maximize return with 3200 designs. Besides the continuous tasks, we perform
experiments on three discrete tasks: (e) GFP [3], where the objective is to find a length 238 protein
sequence to maximize fluorescence with 5000 designs; (f) TF Bind 8 (TFB) [5], where the aim is
to find a length 8 DNA sequence to maximize binding activity score with 32896 designs; and (g)
UTR [6], where the goal is to find a length 50 DNA sequence to maximize expression level with
140, 000 designs. These tasks contain neither personally identifiable nor offensive information.
Evaluation. We follow the same evaluation protocol in [8]: we choose the top N = 128 most
promising designs for each method, and then report the 100th percentile normalized ground truth
y−ymin
score as yn = ymax −ymin where ymin and ymax represent the lowest score and the highest score in
the full unobserved dataset, respectively. The additional 50th percentile (median) normalized ground
truth scores, used in the prior work [8], are also provided in Appendix A.1. To better measure the
performance across multiple tasks, we further report the mean and median ranks of the compared
algorithms over all 7 tasks.

5
4.2 Comparison Methods

We compare BDI with two groups of baselines: (i) sampling via a generative model; and (ii) gradient
updating from existing designs. The generative model based methods learn a distribution of the
high-scoring designs and then sample from the learned distribution. This group of methods includes:
1) CbAS [24]: trains a VAE model of the design distribution using designs with a score above a
threshold and gradually adapts the distribution to the high-scoring part by increasing the threshold. 2)
Auto.CbAS [25]: uses importance sampling to retrain a regression model using the design distribution
introduced by CbAS; 3) MIN [20]: learns an inverse map from the score to the design and queries the
inverse map by searching for the optimal score to obtain the optimal design. The latter group includes:
1) Grad: applies simple gradient ascent on existing designs to obtain new designs; 2) ROMA [7]:
regularizes the smoothness of the DNN and then performs gradient ascent to obtain new designs; 3)
COMs [8]: regularizes the DNN to assign lower scores to designs obtained during the gradient ascent
process and leverages this model to obtain new designs by gradient ascent; 4) NEMO [9]: bounds the
distance between the proxy and the objective function by normalized maximum likelihood and then
performs gradient ascent. We discuss the NTK based Grad in Section 4.5.
Besides the recent work, we also compare BDI with several traditional methods described in [1]:
1) BO-qEI [26]: performs Bayesian Optimization to maximize the proxy, proposes designs via the
quasi-Expected-Improvement acquisition function, and labels the designs via the proxy function; 2)
CMA-ES [27]: an evolutionary algorithm gradually adapts the distribution via the covariance matrix
towards the optimal design; 3) REINFORCE [28]: first learns a proxy function and then optimizes
the distribution over the input space by leveraging the proxy and the policy-gradient estimator.

4.3 Training Details

We follow the training settings of [8] for all comparison methods if not specified. Recall that M
represents the number of the high-scoring designs. We set M = 1 in our experiments, which achieves
expressive results, and discuss the M >1 scenario in Appendix A.2. We set the regularization β as
1e−6 following [16]. We set the predefined target score yh as a constant 10 across all tasks, which
proves to be effective and robust in all cases. We set α as 1e−3 for all continuous tasks and as 0.0 for
all discrete tasks, and set the number of iterations T to 200 in all experiments. We discuss the choices
of yh , α and T in Sec. 4.6 in detail. We adopt a 6-layer MLP (MultiLayer Perceptron) followed by
ReLU for all gradient updating methods and the hidden size is set as 2048. We optimize L(Xh ) with
the Adam optimizer [21] with a 1e−1 learning rate for discrete tasks and a 1e−3 learning rate for
continuous tasks. We cite the results from [8] for the non-gradient-ascent methods including BO-qEI,
CMA-ES2 , REINFORCE, CbAS, Auto.CbAS. For other methods, we run every setting over 8 trials
and report the mean and the standard error. We use the NTK library [29] build on Jax [30] to conduct
kernel-based experiments and use Pytorch [31] for other experiments. All Jax experiments are run on
multiple CPUs within a cluster and all Pytorch experiments are run on one V100 GPU. We discuss
the computational time details for all tasks in Appendix A.3.

4.4 Results and Analysis

We report experimental results in Table 1 for continuous tasks and in Table 2 for discrete tasks where
D(best) represents the maximal score of the static dataset for each task. We bold the results within
one standard deviation of the highest performance.
Results on continuous tasks. As shown in Table 1, BDI achieves the best results over all tasks.
Compared with the naive Grad method, BDI achieves consistent gain over all four tasks, which
suggests that BDI can align the high-scoring design with the static dataset and thus mitigate the
out-of-distribution problem. As for COMs, ROMA and NEMO, which all impose a prior on the
DNN model, they generally perform better than Grad but worse than BDI. This indicates that directly
transforming the information of the static dataset into the high-scoring design is more effective than
adding priors on the DNN in terms of mitigating the out-of-distribution problem. Another advantage
of BDI is in its neural tangent kernel nature: 1) Real-world offline model-based optimization often
involves small-scale datasets since the labeling cost of protein/DNA/robot is very high; 2) BDI is
built on the neural tangent kernel, which generally performs better than finite neural networks on
2
CMA-ES performs very differently on Ant Morphology and D’Kitty Morphology and the reason is that Ant
Morphology is simpler and much more sensitive to initializations.

6
Table 1: Experimental results on continuous tasks for comparison.
Method Superconductor Ant Morphology D’Kitty Morphology Hopper Controller
D(best) 0.399 0.565 0.884 1.0
BO-qEI 0.402 ± 0.034 0.819 ± 0.000 0.896 ± 0.000 0.550 ± 0.118
CMA-ES 0.465 ± 0.024 1.214 ± 0.732 0.724 ± 0.001 0.604 ± 0.215
REINFORCE 0.481 ± 0.013 0.266 ± 0.032 0.562 ± 0.196 −0.020 ± 0.067
CbAS 0.503 ± 0.069 0.876 ± 0.031 0.892 ± 0.008 0.141 ± 0.012
Auto.CbAS 0.421 ± 0.045 0.882 ± 0.045 0.906 ± 0.006 0.137 ± 0.005
MIN 0.452 ± 0.026 0.908 ± 0.020 0.942 ± 0.010 0.094 ± 0.113
Grad 0.483 ± 0.027 0.764 ± 0.047 0.888 ± 0.035 0.989 ± 0.362
COMs 0.487 ± 0.023 0.866 ± 0.050 0.835 ± 0.039 1.224 ± 0.555
ROMA 0.476 ± 0.024 0.814 ± 0.051 0.905 ± 0.018 1.849 ± 0.110
NEMO 0.488 ± 0.034 0.814 ± 0.043 0.924 ± 0.012 1.959 ± 0.159
BDI(ours) 0.520 ± 0.005 0.962 ± 0.000 0.941 ± 0.000 1.989 ± 0.050

Table 2: Experimental results on discrete tasks & ranking on all tasks for comparison.
Method GFP TF Bind 8 UTR Rank Mean Rank Median
D(best) 0.789 0.439 0.593
BO-qEI 0.254 ± 0.352 0.798 ± 0.083 0.684 ± 0.000 8.6/11 9/11
CMA-ES 0.054 ± 0.002 0.953 ± 0.022 0.707 ± 0.014 5.7/11 6/11
REINFORCE 0.865 ± 0.000 0.948 ± 0.028 0.688 ± 0.010 7.4/11 9/11
CbAS 0.865 ± 0.000 0.927 ± 0.051 0.694 ± 0.010 4.6/11 5/11
Auto.CbAS 0.865 ± 0.000 0.910 ± 0.044 0.691 ± 0.012 6.0/11 7/11
MIN 0.865 ± 0.001 0.882 ± 0.020 0.694 ± 0.017 5.3/11 4/11
Grad 0.864 ± 0.001 0.491 ± 0.042 0.666 ± 0.013 7.9/11 8/11
COMs 0.861 ± 0.009 0.920 ± 0.043 0.699 ± 0.011 5.6/11 6/11
ROMA 0.558 ± 0.395 0.928 ± 0.038 0.690 ± 0.012 6.1/11 7/11
NEMO 0.150 ± 0.270 0.905 ± 0.048 0.694 ± 0.015 5.4/11 4/11
BDI(ours) 0.864 ± 0.000 0.973 ± 0.000 0.760 ± 0.000 1.9/11 1/11

small-scale datasets [32]; 3) COMs, ROMA, and NEMO are built on a finite neural network and it is
non-trivial to modify them to the neural tangent kernel version. The generative model based methods
CbAS, Auto.CbAS and MIN perform poorly for high-dimensional tasks like Hopper (D=5126) since
the high-dimensional data distributions are harder to model. Compared with generative model based
methods, BDI not only achieves better results but is also much simpler.
Results on discrete tasks. As shown in Table 2, BDI achieves the best or equal-best performances in
two of the three tasks, and is only marginally inferior in the third. This suggests that BDI is also a
powerful approach in the discrete domain. For the GFP task, every design is a length 238 sequence
of 20-categorical one-hot vectors and gradient updates that do not take into account the sequential
nature may be less beneficial. This may explain why BDI does not perform as well on GFP.
Overall, BDI achieves the best result in terms of the ranking as shown in Table 2 and Figure 2, and
attains the best performance on 6/7 tasks.

Table 3: Ablation studies on BDI components.


Task D BDI w/o Ll2h w/o Lh2l NTK2NN NTK2RBF
GFP 238 0.864 ± 0.000 0.860 ± 0.004 0.864 ± 0.000 0.252 ± 0.354 0.862 ± 0.003
TFB 8 0.973 ± 0.000 0.984 ± 0.000 0.973 ± 0.000 0.958 ± 0.025 0.925 ± 0.000
UTR 50 0.760 ± 0.000 0.636 ± 0.000 0.777 ± 0.000 0.699 ± 0.009 0.738 ± 0.000
SuperC 86 0.520 ± 0.005 0.511 ± 0.007 0.502 ± 0.007 0.450 ± 0.036 0.510 ± 0.012
Ant 60 0.962 ± 0.000 0.914 ± 0.000 0.933 ± 0.000 0.861 ± 0.019 0.927 ± 0.000
D’Kitty 56 0.941 ± 0.000 0.920 ± 0.000 0.942 ± 0.001 0.929 ± 0.010 0.938 ± 0.008
Hopper 5126 1.989 ± 0.050 1.935 ± 0.553 1.821 ± 0.223 0.503 ± 0.039 0.965 ± 0.103

7
BO-qEI GFP Ant
1.0
CMA-ES TFB Dkitty
REINFORCE 1.2 UTR Hopper 0.9
CbAS SupC

Score ratio
Auto.CbAS 1.1 = 0.0 3 = 4e 3
0.8

Score
MIN = 1e 3 = 5e 4
Grad 1.0 = 2e
COMs 0.7
ROMA 0.9
NEMO
BDI(ours)
0.6
1 2 3 4 5 6 7 8 9 10 11 0.8 5 10 15 20 25 30 0 50 100 150 200
Rank Predefined target score yh Step T
Figure 2: Rank minima and max- Figure 3: The ratio of the Figure 4: The ground truth
ima are indicated by whiskers; ver- ground truth score of the de- score of the design as a func-
tical lines and black triangles rep- sign with yh to the ground tion of the step T for different
resent medians and means. truth score with yh = 10. weight parameters α on Ant.

4.5 Ablation Studies

In this subsection, we conduct ablation studies on the components of BDI and aim to verify the
effectiveness of Lh2l (Xh ), Ll2h (Xh ), and NTK. The baseline method is our proposed BDI and we
remove each component to verify its effectiveness. We replace the NTK with its corresponding finite
DNN (6-layer MLP followed by ReLU), which is trained on (Xl , yl ) for Eq.(5) and (Xh , yh ) for
Eq.(7). Following [16], another alternative is to replace the NTK with RBF.
As shown in Table 3, BDI generally performs the best (at least second-best) across tasks, which
indicates the importance of all three components.
Forward mapping. Removing Ll2h (Xh ) yields the best result 0.984 in the TFB task, but worsens
the designs in other cases, which verifies the importance of the forward mapping. Incorporating the
Ll2h (Xh ) loss term is similar to using the Grad method as we discuss in Section 3.1. We also run
experiments on gradient ascent with NTK. This approach first builds a regression model using the
NTK and then performs gradient ascent on existing designs to obtain new designs. Results are very
similar to BDI without Lh2l (Xh ), so we do not report them here.
Backward mapping. Removing Lh2l (Xh ) decreases model performances in three tasks, and only
improves it for the UTR task. This demonstrates the importance of the backward mapping. Although
removing Lh2l leads to the best result 0.777 in UTR compared with 0.760 of BDI, this does not
necessarily demonstrate that the removal of Lh2l is desirable. The 50th percentile of BDI is higher
than that of BDI without Lh2l (0.664 > 0.649). Furthermore, we observe that the backward mapping
is more effective for continuous tasks, possibly because we do not model the sequential nature of
DNA and protein sequences. We follow the default procedure in [8] which maps the sequence design
to real-valued logits of a categorical distribution for a fair comparison. Given a more effective discrete
space modeling, the backward mapping might yield better results; we leave this to future work.
Neural tangent kernel. We replace the NTK with the corresponding finite DNN (NTK2NN) and
the RBF kernel (NTK2RBF), and both replacements degrade the performance significantly, which
verifies the effectiveness of the NTK. In addition, we can observe that NTK2NN performs relatively
well on the low-dimensional task TFB (D=8) but poorly on high-dimensional tasks, especially for
GFP (D = 238) and Hopper (D = 5126). It is likely that the high-dimensional designs Xh are more
sensitive to the approximation in Eq. (9). Furthermore, BDI performs better than NTK2RBF and this
can be explained as follows. Compared with the RBF kernel, the NTK enjoys the high expressiveness
of a DNN [15] and represents a broad class of DNNs [16, 17], which can better extract the design
features and enhance the generalization of the high-scoring designs.

4.6 Hyperparameter Sensitivity

In this subsection, we first study the sensitivity of BDI to the predefined target score yh . Note that
all algorithms in this subsection do not have access to the ground truth scores, which are only used
for analysis. We expect this score to be larger than the maximal score of the static dataset since we
want to obtain designs better than those in the static dataset. We first normalize all scores to have unit
Gaussian statistics following [1, 8]. A small target score may not be able to identify a high-scoring
design and a large score is too hard to reach. Thus we choose to set the target score yh as 10 across all

8
Table 4: Comparison between data distillation and backward mapping.
Comparison Data distillation Backward mapping
Data x images in the training dataset D designs in the offline dataset D
Label content 0-9 for the MNIST dataset some measurement of protein/dna/robot
Label value within the training dataset D larger than the max of the offline dataset D
Loss function cross-entropy for classification mean squared error for regression
Task objective distill D into a small subset distill x to incorporate high-scoring features

tasks. In Figure 3, we report the ratio of the ground truth score of the design with yh to the ground
truth score of the design with yh = 10 and the score details are in Appendix A.4. All tasks are robust
to the change of yh . Although we set yh as 10, which is smaller than the maximal score 10.2 for
Hopper, BDI still yields a good result, which demonstrates the robustness of BDI. For Hopper, after
we set yh to a larger value, we can observe that the corresponding results become better.
Besides yh , we also explore robustness to the choice of the weight parameter α and the number
of steps T. We study the continuous task Ant and the discrete task TFB. As shown in Figure 4,
we visualize the 100th percentile ground truth score of the Ant task as a function of T for α =
0.0, 5e−4 , 1e−3 , 2e−3 , 4e−3 . We can observe that the behavior of BDI is robust to change of α within
the range of evaluated values. BDI yields the high-scoring designs at early steps and then the solutions
remain stable. This demonstrates the robustness of BDI to the choice of T. For the TFB task, we find
the behavior of BDI is identical for different α and the behavior with respect to T is very similar to
that of Ant. Thus we do not report details here. One possible reason for the insensitivity to α is that
TFB is a discrete task and the discrete solution is more robust to α.

5 Related Work

Offline model-based optimization. Recent methods for offline model-based optimization can be
divided into two categories: (i) obtaining designs from a generative model and (ii) applying gradient
ascent on existing designs. In the first category, the methods in [24, 25] gradually adapt a VAE-based
generative model towards the optimized design by leveraging a proxy function. Kumar et al. adopt an
alternative approach in [20], learning a generator that maps the score to the design. The generative
model based methods often require careful tuning to model the high-scoring designs [8].
Gradient-based methods have received substantial attention recently because they can leverage DNN
models to produce improved designs. A challenge in the naive application of such methods is
out-of-distribution designs, for which the trained DNN model produces inaccurate score predictions.
Several approaches have been proposed to address this. Trabucco et al. add a regularizing term for
conservative objective modeling [8, 33]. In [9], Fu et al. leverage normalized maximum likelihood to
bound the distance between the proxy function and the ground truth. Yu et al. overcome the brittleness
of the proxy function by utilizing a local smoothness prior [7]. Our work falls into the gradient-based
line of research, but rather than imposing a prior on the model, we propose the bidirectional mapping
to ensure that the proposed designs can be used to predict the scores of the data and vice versa,
thus ensuring that the high-scoring designs are more aligned with the static dataset. Similar to our
backward mapping constraining designs, BCQ [34] constrains the state-action pairs to be contained
in the support of the static dataset. Our proposed backward mapping is different in its implementation
and also enables the model to generate designs outside the training distribution.
Data distillation. Data distillation [18, 35–38] aims to distill a large training set into a small one,
which enables efficient retraining. Building upon [18], the methods in [39, 40] propose to distill
labels instead of images. Approaches in [16, 41] introduce a meta-learning algorithm powered by
NTK, which significantly improves the performance. The backward mapping of BDI is inspired by
[16, 41] and the difference is that BDI is handling a regression problem with a predefined target score
larger than the maximal score of the static dataset. In Table 4 we provided a detailed comparison
of backward mapping and data distillation; both adopt a bi-level framework [42]. This bi-level
framework can also be used to learn other hyperaparameters [17, 43–47].

9
6 Conclusion and discussion
In this paper, we address offline model-based optimization and propose a novel approach that involves
bidirectional score predictions. BDI requires the proposed high-scoring designs and the static offline
dataset to predict each other, which effectively ameliorates the out-of-distribution problem. The finite-
width DNN can only yield an approximate loss function, which leads to a significant deterioration
of the design quality, especially for high-dimensional tasks. We thus adopt an infinite-width DNN
and employ the NTK to yield a closed-form solution, leading to better designs. Experimental results
verify the effectiveness of BDI, demonstrating that it outperforms state-of-the-art approaches.
Limitations. Although BDI has been demonstrated to be effective in tasks from a wide range
of domains, some evaluations are not performed with a realistic setup. For example, in some of
our experiments, such as the GFP [3] task, we follow the prior work and use a transformer model
which is pre-trained on a large dataset as the evaluation oracle. However, this may not reflect real-
world behavior and the evaluation protocol can be further improved by close collaboration with
domain experts [48–50]. Overall, we still believe BDI can generalize well to these cases since BDI
has a simple formulation and proves to be effective and robust for the wide range of tasks in the
design-bench.
Negative impacts. Designing a new object or entity with desired properties is a double-edged sword.
On one hand, it can be beneficial to society, with an example being drug discovery to cure unsolved
diseases. On the other hand, if these techniques are acquired by dangerous people, they can also be
used to design biochemicals to harm our society. Researchers should pay vigilant attention to ensure
their research does end up being used positively for the social good.

7 Acknowledgement
We thank Aristide Baratin from Mila and Greg Yang from Microsoft Research for their helpful
discussions on the neural tangent kernel and thank Zixuan Liu from the University of Washington for
his helpful suggestions on the paper writing. This research was supported in part by funding from the
Fonds de recherche du Québec – Nature et technologies.

References
[1] Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. Design-bench: Bench-
marks for data-driven offline model-based optimization. arXiv preprint arXiv:2202.08450,
2022.

[2] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a
superconductor. Computational Materials Science, 2018.

[3] Karen S Sarkisyan et al. Local fitness landscape of the green fluorescent protein. Nature, 2016.

[4] Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and
Lucy Colwell. Model-based reinforcement learning for biological sequence design. In Proc. Int.
Conf. Learning Rep. (ICLR), 2019.

[5] Luis A Barrera et al. Survey of variation in human transcription factors reveals prevalent dna
binding changes. Science, 2016.

[6] Paul J Sample, Ban Wang, David W Reid, Vlad Presnyak, Iain J McFadyen, David R Morris,
and Georg Seelig. Human 5 UTR design and variant effect prediction from a massively parallel
translation assay. Nature Biotechnology, 2019.

[7] Sihyun Yu, Sungsoo Ahn, Le Song, and Jinwoo Shin. Roma: Robust model adaptation for
offline model-based optimization. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2021.

[8] Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective
models for effective offline model-based optimization. In Proc. Int. Conf. Learning Rep. (ICLR),
2021.

10
[9] Justin Fu and Sergey Levine. Offline model-based optimization via normalized maximum
likelihood estimation. Proc. Int. Conf. Learning Rep. (ICLR), 2021.
[10] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter
Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning
via sequence modeling. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2021.
[11] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2018.
[12] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-
Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models
under gradient descent. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2019.
[13] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and
Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. Proc. Int. Conf. Learning
Rep. (ICLR), 2017.
[14] Greg Yang and Etai Littwin. Tensor programs iib: Architectural universality of neural tangent
kernel training dynamics. In Proc. Int. Conf. Machine Lea. (ICML), 2021.
[15] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak,
and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Proc.
Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[16] Timothy Nguyen, Zhourong Chen, and Jaehoon Lee. Dataset meta-learning from kernel
ridge-regression. Proc. Int. Conf. Learning Rep. (ICLR), 2020.
[17] Chia-Hung Yuan and Shan-Hung Wu. Neural tangent generalization attacks. In Proc. Int. Conf.
Machine Lea. (ICML), 2021.
[18] Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation.
arXiv preprint arXiv:1811.10959, 2018.
[19] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil.
Bilevel programming for hyperparameter optimization and meta-learning. In Proc. Int. Conf.
Machine Lea. (ICML), 2018.
[20] Aviral Kumar and Sergey Levine. Model inversion networks for model-based optimization.
Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. Int.
Conf. Learning Rep. (ICLR), 2015.
[22] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,
and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[23] Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine,
and Vikash Kumar. Robel: Robotics benchmarks for learning with low-cost robots. In Conf. on
Robot Lea. (CoRL), 2020.
[24] David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling
for robust design. In Proc. Int. Conf. Machine Lea. (ICML), 2019.
[25] Clara Fannjiang and Jennifer Listgarten. Autofocused oracles for model-based design. Proc.
Adv. Neur. Inf. Proc. Syst (NeurIPS), 2020.
[26] James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparame-
terization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
[27] Nikolaus Hansen. The CMA evolution strategy: A comparing review. In Jose A. Lozano,
Pedro Larrañaga, Iñaki Inza, and Endika Bengoetxea, editors, Towards a New Evolutionary
Computation: Advances in the Estimation of Distribution Algorithms, pages 75–102. Springer
Berlin Heidelberg, Berlin, Heidelberg, 2006.

11
[28] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning, 1992.
[29] Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-
Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks
in python. In Proc. Int. Conf. Learning Rep. (ICLR), 2020.
[30] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal
Maclaurin, and Skye Wanderman-Milne. Jax: composable transformations of python+ numpy
programs. URL http://github.com/google/jax, 2020.
[31] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
style, high-performance deep learning library. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2019.
[32] Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, and Dingli
Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. In Proc. Int. Conf.
Machine Lea. (ICML), 2020.
[33] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate
once: Accelerating adversarial training via maximal principle. Proc. Adv. Neur. Inf. Proc. Syst
(NeurIPS), 2019.
[34] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning
without exploration. In Proc. Int. Conf. Machine Lea. (ICML), 2019.
[35] Yongchao Zhou, Ehsan Nezhadarya, and Jimmy Ba. Dataset distillation using neural feature
regression. arXiv preprint arXiv:2206.00719, 2022.
[36] Justin Cui, Ruochen Wang, Si Si, and Cho-Jui Hsieh. DC-BENCH: Dataset condensation
benchmark. arXiv preprint arXiv:2207.09639, 2022.
[37] Noveen Sachdeva, Mehak Preet Dhaliwal, Carole-Jean Wu, and Julian McAuley. Infinite
recommendation networks: A data-centric approach. arXiv preprint arXiv:2206.02626, 2022.
[38] Songhua Liu, Kai Wang, Xingyi Yang, Jingwen Ye, and Xinchao Wang. Dataset distillation via
factorization. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2022.
[39] Ondrej Bohdal, Yongxin Yang, and Timothy Hospedales. Flexible dataset distillation: Learn
labels instead of images. arXiv preprint arXiv:2006.08572, 2020.
[40] Ilia Sucholutsky and Matthias Schonlau. Soft-label dataset distillation and text dataset distilla-
tion. In Proc. Int. Joint Conf. on Neur. Net. (IJCNN), 2021.
[41] Timothy Nguyen, Roman Novak, Lechao Xiao, and Jaehoon Lee. Dataset distillation with
infinitely wide convolutional networks. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS), 2021.
[42] Can Chen, Xi Chen, Chen Ma, Zixuan Liu, and Xue Liu. Gradient-based bi-level optimization
for deep learning: A survey. arXiv preprint arXiv:2207.11719, 2022.
[43] Zhiting Hu, Bowen Tan, Russ R Salakhutdinov, Tom M Mitchell, and Eric P Xing. Learning
data manipulation for augmentation and weighting. Proc. Adv. Neur. Inf. Proc. Syst (NeurIPS),
2019.
[44] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples
for robust deep learning. In Proc. Int. Conf. Machine Lea. (ICML), 2018.
[45] Can Chen, Chen Ma, Xi Chen, Sirui Song, Hao Liu, and Xue Liu. Unbiased implicit feedback
via bi-level optimization. arXiv preprint arXiv:2206.00147, 2022.
[46] Can Chen, Shuhao Zheng, Xi Chen, Erqun Dong, Xue Steve Liu, Hao Liu, and Dejing Dou.
Generalized data weighting via class-level gradient manipulation. Proc. Adv. Neur. Inf. Proc.
Syst (NeurIPS), 2021.

12
[47] Can Chen, Jingbo Zhou, Fan Wang, Xue Liu, and Dejing Dou. Structure-aware protein self-
supervised learning. arXiv preprint arXiv:2204.04213, 2022.
[48] Kexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao, Yusuf Roohani, Jure Leskovec, Connor W
Coley, Cao Xiao, Jimeng Sun, and Marinka Zitnik. Artificial intelligence foundation for
therapeutic science. Nature Chemical Biology, 2022.
[49] Changlin Wan. Machine Learning Approaches to Reveal Discrete Signals in Gene Expression.
PhD thesis, Purdue University Graduate School, 2022.
[50] Xiao Luo, Xinming Tu, Yang Ding, Ge Gao, and Minghua Deng. Expectation pooling: an
effective and interpretable pooling method for predicting dna–protein binding. Bioinformatics,
2020.
[51] Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. Dnabert: pre-trained bidirectional
encoder representations from transformers model for dna-language in genome. Bioinformatics,
2021.
[52] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones,
Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. ProtTrans: towards
cracking the language of life’s code through self-supervised deep learning and high performance
computing. arXiv preprint arXiv:2007.06225, 2020.
[53] Courtney E Gonzalez and Marc Ostermeier. Pervasive pairwise intragenic epistasis among
sequential mutations in tem-1 β-lactamase. Journal of molecular biology, 2019.
[54] Elad Firnberg, Jason W Labonte, Jeffrey J Gray, and Marc Ostermeier. A comprehensive,
high-resolution map of a gene’s fitness landscape. Molecular biology and evolution, 2014.
[55] Dinghuai Zhang, Jie Fu, Yoshua Bengio, and Aaron Courville. Unifying likelihood-free
inference with black-box optimization and beyond. In Proc. Int. Conf. Learning Rep. (ICLR),
2021.
[56] Emily E Wrenbeck, Laura R Azouz, and Timothy A Whitehead. Single-mutation fitness
landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nature
Communications, 2017.
[57] Justin R Klesmith, John-Paul Bacik, Ryszard Michalczyk, and Timothy A Whitehead. Compre-
hensive sequence-flux mapping of a levoglucosan utilization pathway in e. coli. ACS Synthetic
Biology, 2015.
[58] Jochen Weile, Song Sun, Atina G Cote, Jennifer Knapp, Marta Verby, Joseph C Mellor,
Yingzhou Wu, Carles Pons, Cassandra Wong, Natascha van Lieshout, et al. A framework for
exhaustively mapping functional missense variants. Molecular Systems Biology, 2017.
[59] Zuobai Zhang, Minghao Xu, Arian Jamasb, Vijil Chenthamarakshan, Aurelie Lozano, Payel
Das, and Jian Tang. Protein representation learning by geometric structure pretraining. arXiv
preprint arXiv:2203.06125, 2022.

Checklist
1. For all authors...
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] We discuss limitations in Sec-
tion 6.
(c) Did you discuss any potential negative societal impacts of your work? [Yes] We discuss
negative impacts in Section 6.
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...

13
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] They are in the
supplemental material.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] All of these can in be found in Section 4.
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] We run every setting 8 times and report the mean and
standard deviation in the tables.
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] It is in Section 4.3.
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes] We cite them [1].
(b) Did you mention the license of the assets? [Yes] We mention the license in the
README.md of our code in the supplemental material.
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
We include our code in the supplemental material as a new asset and we will release
this new asset upon paper acceptance.
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [Yes] The assets are publicly available and free for research use.
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] We use design-bench, which does not contain
personally identifiable information nor offensive content, as illustrated in Section 4.1.
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

14

You might also like