(AutoCbAS) Autofocused Oracles For Model-Based Design
(AutoCbAS) Autofocused Oracles For Model-Based Design
Abstract
1 Oracle-based design
The design of objects with desired properties, such as novel proteins, molecules, or materials, has
a rich history in bioengineering, chemistry, and materials science. In these domains, design has
historically been performed through iterative, labor-intensive experimentation [1] (e.g., measuring
protein binding affinity) or compute-intensive physics simulations [2] (e.g., computing low-energy
structures for nanomaterials). Increasingly, however, attempts are being made to replace these costly
and time-consuming steps with cheap and fast calls to a proxy regression model, trained on labeled
data [3, 4, 5, 6, 7]. Herein, we refer to such a proxy model as an oracle, and assume that acquisition
of training data for the oracle is complete, as in [5, 6, 7, 8, 9].1 The key issue addressed by our work
is how best to train an oracle for use in design, given fixed training data.
In contrast to the traditional use of predictive models, oracle-based design is distinguished by the fact
that it seeks solutions—and therefore, will query the oracle—in regions of the design space that are
not well-represented by the oracle training data. If this is not the case, the design problem is easy in
that the solution is within the region of the training data. Furthermore, one does not know beforehand
which parts of the design space a design procedure will navigate through. As such, a major challenge
arises when an oracle is employed for design: its outputs, including its uncertainty estimates, become
unreliable beyond the training data [10, 11]. Successful oracle-based design thus involves an inherent
trade-off between the need to stay “near” the training data in order to trust the oracle, and the need to
depart from it in order to make improvements. While trust region approaches have been developed
1
For many applications in protein, molecule, and material design, even if one performs iterative rounds of
data acquisition, at some point, the acquisition phase concludes due to finite resources.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
to help address this trade-off [11, 7], herein, we take a different approach and ask: what is the most
effective way to use a fixed, labeled dataset to train an oracle for design?
Contributions We develop a novel approach to oracle-based design that specifies how to update
the oracle as the design space is explored—what we call autofocusing the oracle. In particular, we (i)
formalize oracle-based design as a non-zero-sum game, (ii) derive an oracle-updating strategy for
seeking a Nash equilibrium, and (iii) demonstrate empirically that autofocusing holds promise for
improving oracle-based design.
Design problems can be cast as seeking points in the design space, x ∈ X , that with high probability
satisfy desired conditions on a property random variable, y ∈ R. For example, one might want
to design a superconducting material by specifying its chemical composition, x, such that the
resulting material has critical temperature greater than some threshold, y ≥ yτ , or has maximal
critical temperature, y = ymax . We specify the desired properties using a constraint set, S, such
as S = {y : y ≥ yτ } for some yτ . The design goal is then to solve arg maxx P (y ∈ S | x). This
optimization problem over the inputs, x, can be converted to one over distributions over the design
space [12, 13]. Specifically, model-based optimization (MBO) seeks the parameters, θ, of a “search
model”, pθ (x), that maximizes an objective that bounds the original objective:
Z
max P (y ∈ S | x) ≥ max Epθ (x) [P (y ∈ S | x)] = max Epθ (x) p(y | x)dy . (1)
x θ∈Θ θ∈Θ S
The original optimization problem over x, and the MBO problem over θ, are equivalent when the
search model has the capacity to place point masses on optima of the original objective. Reasons
for using the MBO formulation include that it requires no gradients of p(y | x), thereby allowing
the use of arbitrary oracles for design, including those that are not differentiable with respect to
the design space and otherwise require specialized treatment [14]. MBO also naturally allows one
to obtain not just a single design candidate, but a diverse set of candidates, by sampling from the
final search distribution (whose entropy can be adjusted by adding regularization to Equation 1).
Finally, MBO introduces the language of probability into the optimization, thereby allowing coherent
incorporation of probabilistic constraints such as implicit trust regions [11]. The search model can
be any parameterized probability distribution that can be sampled from, and whose parameters can
be estimated using weighted maximum likelihood estimation (MLE) or approximations thereof.
Examples include mixtures of Gaussians, hidden Markov models, variational autoencoders [15], and
Potts models [16]. Notably, the search model distribution can be over discrete or continuous random
variables, or a combination thereof.
We use the phrase model-based design (MBD) to denote use of MBO to solve a design problem.
Hereafter, we focus on oracle-based MBD, which attempts to solve Equation 1 by replacing costly
and time-consuming queries of the ground truth2 , p(y | x), with calls to a trained regression model
(i.e., oracle), pβ (y | x), with parameters, β ∈ B. Given access to a fixed dataset, {(xi , yi )}ni=1 , the
oracle is typically trained once using standard techniques and thereafter considered fixed [5, 6, 7, 8, 9,
11, 17, 18]. In what follows, we describe why such a strategy is sub-optimal and how to re-train the
oracle in order to better achieve design goals. First, however, we briefly review a common approach
for performing MBO, as we will leverage such algorithms in our approach.
MBO problems are often tackled with an Estimation of Distribution Algorithm (EDA) [19, 20], a class
of iterative optimization algorithms that can be seen as Monte Carlo expectation-maximization [13];
EDAs are also connected to the cross-entropy method [21, 22] and reward-weighted regression in
2
We refer to the ground truth as the distribution of direct property measurements, which are inevitably
stochastic due to sensor noise.
2
reinforcement learning [23]. Given an oracle, pβ (y | x), and an initial search model, pθ(t=0) , an EDA
typically proceeds at iteration t with two core steps:
1. “E-step”: Sample from the current search model, x̃i ∼ pθ(t−1) (x) for all i ∈ {1, . . . , m}.
Compute a weight for each sample, vi := V (Pβ (y ∈ S | x̃i )), where V (.) is a method-
specific, monotonic transformation.
2. “M-step”: Perform weighted MLE to yield an updated search model, pθ(t) (x), which tends
to have more mass where Pβ (y ∈ S | x) is high. (Some EDAs can be seen as performing
maximum a posteriori inference instead, which results in smoothed parameter updates [11].)
Upon convergence of the EDA, design candidates can be sampled from the final search model if it
is not a point mass; one may also choose to use promising samples from earlier iterations. Notably,
the oracle, pβ (y | x), remains fixed in the steps above. Next, we motivate a new formalism for
oracle-based MBD that yields a principled approach for updating the oracle at each iteration.
the solution to the oracle-based problem will, in general, be sub-optimal with respect to the original
objective that uses the ground truth, P (y ∈ S | x). This sub-optimality can be extreme due to
pathological behavior of the oracle when the search model, pθ (x), strays too far from the training
distribution during the optimization [11].
Since one cannot access the ground truth, we seek a practical alternative wherein we can leverage an
oracle, but also infer when the values of the ground-truth and oracle-based objectives (in Equations 1
and 2, respectively) are likely to be close. To do so, we introduce the notion of the oracle gap, defined
as Epθ (x) [|P (y ∈ S | x) − Pβ (y ∈ S | x)|]. When this quantity is small, then by Jensen’s inequality
the oracle-based and ground-truth objectives are close. Consequently, our insight for improving
oracle-based design is to use the oracle that minimizes the oracle gap,
arg min O RACLE G AP(θ, β) = arg min Epθ (x) [|P (y ∈ S | x) − Pβ (y ∈ S | x)|]. (3)
β∈B β∈B
Together, Equations 2 and 3 define the coupled objectives of two players, namely the search model
(with parameters θ) and the oracle (with parameters β), in a non-zero-sum game. To attain good
objective values for both players, our goal will be to search for a Nash equilibrium—that is, a pair of
values (θ∗ , β ∗ ) such that neither can improve its objective given the other. To do so, we develop an
alternating ascent-descent algorithm, which alternates between (i) fixing the oracle parameters and
updating the search model parameters to increase the objective in Equation 2 (the ascent step), and
(ii) fixing the search model parameters and updating the oracle parameters to decrease the objective
in Equation 3 (the descent step). In the next section, we describe this algorithm in more detail.
Practical interpretation of the MBD game. Interpreting the usefulness of this game formulation
requires some subtlety. The claim is not that every Nash equilibrium yields a search model that
provides a high value of the (unknowable) ground-truth objective in Equation 1. However, for any
pair of values, (θ, β), the value of the oracle gap provides a certificate on the value of the ground-truth
objective. In particular, if one has an oracle and search model that yield an oracle gap of , then by
Jensen’s inequality the ground-truth objective is within of the oracle-based objective. Therefore,
to the extent that we are able to minimize the oracle gap (Equation 3), we can trust the value of our
3
oracle-based objective (Equation 2). Note that a small, or even zero oracle gap only implies that the
oracle-based objective is trustworthy; successful design also entails achieving a high oracle-based
objective, the potential for which depends on an appropriate oracle class and suitably informative
training data (as it always does for oracle-based design, regardless of whether our framework is used).
Although the oracle gap as a certificate is useful conceptually for motivating our approach, at present
it is not clear how to estimate it. In our experiments, we found that we could demonstrate the
benefits of autofocusing without directly estimating the oracle gap, relying solely on the principle
of minimizing it. We also note that in practice, what matters is not whether we converge to a Nash
equilibrium, just as what matters in empirical risk minimization is not whether one exactly recovers
the global optimum, only a useful point. That is, if we can find parameters, (θ, β), that yield better
designs than alternative methods, then we have developed a useful method.
Our approach alternates between an ascent step that updates the search model, and a descent step
that updates the oracle. The ascent step is relatively straightforward as it leverages existing MBO
algorithms. The descent step, however, requires some creativity. In particular, for the ascent step,
we run a single iteration of an MBO algorithm as described in §2.1, to obtain a search model that
increases the objective in Equation 2. For the descent step, we aim to minimize the oracle gap in
Equation 3 by making use of the following observation (proof in Supplementary Material §S2).
Proposition 1. For any search model, pθ (x), if the oracle parameters, β, satisfy
Z
Epθ (x) [DKL (p(y | x) || pβ (y | x))] = DKL (p(y | x) || pβ (y | x)) pθ (x)dx ≤ , (4)
X
where DKL (p || q) is the Kullback-Leibler (KL) divergence between distributions p and q, then the
following bound holds:
r
Epθ (x) [|P (y ∈ S | x) − Pβ (y ∈ S | x)|] ≤ . (5)
2
As a consequence of Proposition 1, given any search model, pθ (x), an oracle that minimizes the
expected KL divergence in Equation 4 also minimizes an upper bound on the oracle gap. Our
descent strategy is therefore to minimize this expected divergence. In particular, as shown in the
Supplementary Material §S2, the resulting oracle parameter update at iteration t can be written
as β (t) = arg maxβ∈B Epθ(t) (x) Ep(y|x) [log pβ (y | x)], where we refer to the objective as the log-
likelihood under the search model. Although we cannot generally access the ground truth, p(y | x),
we do have labeled training data, {(xi , yi )}ni=1 , whose labels come from the ground-truth distribution,
yi ∼ p(y | x = xi ). We therefore use importance sampling with the training distribution, p0 (x), as
the proposal distribution, to obtain a now-practical oracle parameter update,
n
(t) 1 X pθ(t) (xi )
β = arg max log pβ (yi | xi ). (6)
β∈B n i=1 p0 (xi )
The training points, xi , are used to estimate some model for p0 (x), while pθ(t) (x) is given by the
search model. We discuss the variance of the importance weights, wi := pθ (xi )/p0 (xi ), shortly.
Together, the ascent and descent steps amount to appending a “Step 3” to each iteration of the generic
two-step MBO algorithm outlined in §2.1, in which the oracle is retrained on re-weighted training
data according to Equation 6. We call this strategy autofocusing the oracle, as it retrains the oracle
in lockstep with the search model, to keep the oracle likelihood maximized on the most promising
regions of the design space. Pseudo-code for autofocusing can be found in the Supplementary
Material (Algorithms 1 and 2). As shown in the experiments, autofocusing tends to improve the
outcomes of design procedures, and when it does not, no harm is incurred relative to the naive
approach with a fixed oracle. Before discussing such experiments, we first make some remarks.
Controlling variance of the importance weights. It is well known that importance weights can
have high, even infinite, variance [26], which may prevent the importance-sampled estimate of the
4
log-likelihood from being useful for retraining the oracle effectively. That is, solving Equation
6 may not reliably yield oracle parameter estimates that minimize the log-likelihood under the
search model. To monitor the reliability of the importance-sampled estimate,
Pn one can
Pncompute and
track an effective sample size of the re-weighted training data, ne := ( i=1 wi )2 / i=1 wi2 , which
reflects the variance of the importance weights [26]. If one has some sense of a suitable sample size
for the application at hand (e.g., based on the oracle model capacity), then one could monitor ne
and choose not to retrain when it is too small. Another variance control strategy is to use a trust
region to constrain the movement of the search model, such as in [11], which automatically controls
the variance (see Supplementary Material Proposition S2.2). Indeed, our experiments show how
autofocusing works synergistically with a trust-region approach. Finally, two other common strategies
are: (i) self-normalizing the weights, which provides a biased but consistent and lower-variance
estimate [26], and (ii) flattening the weights [24] to wiα according to a hyperparameter, α ∈ [0, 1].
The value of α interpolates between the original importance weights (α = 1), which provide an
unbiased but high-variance estimate, and all weights equal to one (α = 0), which is equivalent to
naively training the oracle (i.e., no autofocusing).
Oracle bias-variance trade-off. If the oracle equals the ground truth over all parts of the design
space encountered during the design procedure, then autofocusing should not improve upon using a
fixed oracle. In practice, however, this is unlikely to ever be the case—the oracle is almost certain
to be misspecified and ultimately mislead the design procedure with incorrect inductive bias. It is
therefore interesting to consider what autofocusing does from the perspective of the bias-variance
trade-off of the oracle, with respect to the search model distribution. On the one hand, autofocusing
retrains the oracle using an unbiased estimate of the log-likelihood over the search model. On the
other hand, as the search model moves further away from the training data, the effective sample size
available to train the oracle decreases; correspondingly, the variance of the oracle increases. In other
words, when we use a fixed oracle (no autofocusing), we prioritize minimal variance at the expense of
greater bias. With pure autofocusing, we prioritize reduction in bias at the expense of higher variance.
Autofocusing with techniques to control the variance of the importance weights [24, 27] enables us
to make a suitable trade-off between these two extremes.
4 Related Work
Although there is no cohesive literature on oracle-based design in the fixed-data setting, its use is
gaining prominence in several application areas, including the design of proteins and nucleotide
sequences [7, 11, 17, 18, 28], molecules [29, 8, 9], and materials [6, 30]. Within such work, the
danger in extrapolating beyond the training distribution is not always acknowledged or addressed. In
fact, proposed design procedures often are validated under the assumption that the oracle is always
correct [8, 14, 17, 18, 29]. Some exceptions include Conditioning by Adaptive Sampling (CbAS) [11],
which employs a probabilistic trust-region approach using a model of the training distribution, and [7],
which uses a hard distance-based threshold. Similar in spirit to [11], Linder et al. regularize the
designed sequences based on their likelihood under a model of the training distribution [31]. In
another approach, a variational autoencoder implicitly enforces a trust region by constraining design
candidates to the probabilistic image of the decoder [9]. Finally, Kumar & Levine tackle design by
learning the inverse of a ground-truth function, which they constrain to agree with an oracle, so as
to discourage too much departure from the training data [28]. None of these approaches update the
oracle in any way. However, autofocusing is entirely complementary to and does not preclude the
additional use of any of these approaches. For example, we demonstrate in our experiments that
autofocusing improves the outcomes of CbAS, which implicitly inhibits the movement of the search
model away from the training distribution.
5
Related to the design problem is that of active learning in order to optimize a function, using for
example Bayesian optimization [32]. Such approaches are fundamentally distinct from our setting in
that they dynamically acquire new labeled data, thereby more readily allowing for correction of oracle
modeling errors. In a similar spirit, evolutionary algorithms sometimes use a “surrogate” model of the
function of interest (equivalent to our oracle), to help guide the acquisition of new data [33]. In such
settings, the surrogate may be updated using an ad hoc subset of the data [34] or perturbation of the
surrogate parameters [35]. Similarly, a recent reinforcement-learning based approach to biological
sequence design relies on new data to refine the oracle when moving into a region of design space
where the oracle is unreliable [36].
Offline reinforcement learning (RL) [37] shares similar characteristics with our problem in that the
goal is to find a policy that optimizes a reward function, given only a fixed dataset of trajectories
sampled using another policy. In particular, offline model-based RL leverages a learned model
of dynamics that may not be accurate everywhere. Methods in that setting have attempted to
account for the shift away from the training distribution using uncertainty estimation and trust-region
approaches [38, 39, 40]; importance sampling has also been used for off-policy evaluation [41, 42].
As noted in the previous section, autofocusing operates through iterative retraining of the oracle in
order to correct for covariate shift induced by the movement of the search model. It can therefore be
connected to ideas from domain adaptation more broadly [25]. Finally, we note that mathematically,
oracle-based MBD is related to the decision-theoretic framework of performative prediction [43].
Perdomo et al. formalize the phenomenon in which using predictive models to perform actions
induces distributional shift, then present theoretical analysis of repeated retraining with new data as a
solution. Our problem has two major distinctions from this setting: first, the ultimate goal in design
is to maximize an unknowable ground-truth objective, not to minimize risk of the oracle. The latter
is only relevant to the extent that it helps us achieve the former, and our work operationalizes that
connection by formulating and minimizing the oracle gap. Second, we are in a fixed-data setting. Our
work demonstrates the utility of adaptive retraining even in the absence of new data.
5 Experiments
We now demonstrate empirically, across a variety of both experimental settings and MBO algorithms,
how autofocusing can help us better achieve design goals. First we leverage an intuitive example
to gain detailed insights into how autofocus behaves. We then conduct a detailed study on a more
realistic problem of designing superconducting materials. Code for our experiments is available at
https://github.com/clarafy/autofocused_oracles.
To investigate how autofocusing works in a setting that can be understood intuitively, we constructed
a one-dimensional design problem where the goal was to maximize a multi-modal ground-truth
function, f (x) : R → R+ , given fixed training data (Figure 1a). The training distribution from which
training points were drawn, p0 (x), was a Gaussian with variance, σ02 , centered at 3, a point where
f (x) is small relative to the global maximum at 7. This captures the common scenario where the
oracle training data do not extend out to global optima of the property of interest. As we increase the
variance of the training distribution, σ02 , the training data become more and more likely to approach
the global maximum of f (x). The training labels are drawn from p(y | x) = N (f (x), σ2 ), where
σ2 is the variance of the label noise. For this example, we used CbAS [11], an MBO algorithm that
employs a probabilistic trust region. We did not control the variance of the importance weights.
An MBO algorithm prescribes a sequence of search models as the optimization proceeds, where
each successive search model is fit using weighted MLE to samples from its predecessor. However,
in our one-dimensional example, one can instead use numerical quadrature to directly compute
each successive search model [11]. Such an approach enables us to abstract out the particular
parametric form of the search model, thereby more directly exposing the effects of autofocusing.
In particular, we used numerical quadrature to compute the search model density at iteration t as
p(t) (x) ∝ Pβ (t) (y ∈ S (t) | x)p0 (x), where S (t) belongs to a sequence of relaxed constraint sets such
that S (t) ⊇ S (t+1) ⊇ S [11]. We computed this sequence of search models in two ways: (i) without
autofocusing, that is, with a fixed oracle trained once on equally weighted training data, and (ii)
with autofocusing, that is, where the oracle was retrained at each iteration. In both cases, the oracle
6
Figure 1: Illustrative example. Panels (a-d) show detailed snapshots of the MBO algorithm,
CbAS [11], with and without autofocusing (AF) in each panel. The vertical axis represents both y
values (for the oracle and ground truth) and probability density values (of the training distribution,
p0 (x), and search distributions, pθ(t) (x)). Shaded envelopes correspond to ±1 standard deviation
of the oracles, σβ (t) , with the oracle expectations, µβ (t) (x), shown as a solid line. Specifically, (a)
at initialization, the oracle and search model are the same for AF and non-AF. Intermediate and
final iterations are shown in (b-d), where the non-AF and AF oracles and search models increasingly
diverge. Greyscale of training points corresponds to their importance weights used for autofocusing.
In (d), each star and dotted horizontal line indicate the ground-truth value corresponding to the point
of maximum density, indicative of the quality of the final search model (higher is better). The values
of (σ , σ0 ) used here correspond to the ones marked by an × in Figure 2, which summarizes results
across a range of settings. Panels (e,f) show the search model over all iterations without and with
autofocusing, respectively.
Figure 2: Improvement from autofocusing (AF) over a wide range of settings of the illustrative
example. Each colored square shows the improvement (averaged over 50 trials) conferred by AF for
one setting, (σ , σ0 ), of, respectively, the standard deviations of the training distribution and the label
noise. Improvement is quantified as the difference between the ground-truth objective in Equation 1
achieved by the final search model with and without AF. A positive value means AF yielded higher
ground-truth values (i.e., performed better than without AF), while zero means it neither helped nor
hurt. Similar plots to Figure 1 are shown in the Supplementary Material for other settings (Figure S1).
was of the form pβ (y | x) = N (µβ (x), σβ2 ), where µβ (x) was fit by kernel ridge regression with a
radial basis function kernel and σβ2 was set to the mean squared error between µβ (x) and the labels
(see Supplementary Material §S3 for more details). Since this was a maximization problem, the
desired condition was set as S = {y : y ≥ maxx µβ (x)} (where µβ (x) = 0.68 for the initial oracle).
We found that autofocusing more effectively shifts the search model toward the ground-truth global
maximum as the iterations proceed (Figure 1b-f), thereby providing improved design candidates.
To understand the effect of autofocusing more systematically, we repeated the experiment just
described across a range of settings of the variances of the training distribution, σ02 , and of the label
7
Table 1: Designing superconducting materials. We ran six different MBO methods, each with and
without autofocusing. For each method, we extracted those samples with oracle expectations above
the 80th percentile and computed their ground-truth expectations. We report the median and maximum
of those ground-truth expectations (both in degrees K), their percent chance of improvement (PCI, in
percent) over the maximum label in the training data, as well as the Spearman correlation (ρ) and root
mean squared error (RMSE, in degrees K) between the oracle and ground-truth expectations. Each
reported score is averaged over 10 trials, where, in each trial, a different training set was sampled
from the training distribution. “Mean Diff.” is the average difference between the score when using
autofocusing compared to not. Bold values with one star (*) and two stars (**), respectively, mean
p-values < 0.05 and < 0.01 from a two-sided Wilcoxon signed-rank test on the 10 paired score
differences between a method with autofocus and without (’Original’). For all scores but RMSE, a
higher value means autofocusing yielded better results (as indicated by the arrow ↑); for RMSE, the
opposite is true (as indicated by the arrow ↓).
Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓ Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓
CbAS DbAS
Original 51.5 103.8 0.11 0.05 17.2 57.0 98.4 0.11 0.01 29.6
Autofocused 76.4 119.8 3.78 0.56 12.9 78.9 111.6 4.4 0.01 24.5
Mean Diff. 24.9** 16.0** 3.67** 0.51** -4.4** 21.9** 13.2** 4.2** 0.01 -5.1*
RWR FB
Original 43.4 102.0 0.05 0.92 7.4 49.2 100.6 0.14 0.09 17.5
Autofocused 71.4 114.0 1.60 0.65 12.7 64.2 111.6 0.86 0.49 11.1
Mean Diff. 28.0** 12.0** 1.56** -0.27** 5.4** 15.0** 11.0** 0.73** 0.40** -6.4**
CEM-PI CMA-ES
Original 34.5 55.8 0.00 -0.16 148.3 42.1 69.4 0.00 0.27 27493.2
Autofocused 67.0 98.0 1.69 0.13 29.4 50.2 85.8 0.01 0.52 9499.8
Mean Diff. 32.6** 42.3* 1.69* 0.29 -118.9** 8.1* 16.3* 0.01 0.25* -17993.5*
noise, σ2 (Figure 2). Intuitively, both these variances control how informative the training data are
about the ground-truth global maximum: as σ02 increases, the training data are more likely to include
points near the global maximum, and as σ2 decreases, the training labels are less noisy. Therefore,
if the training data are either too uninformative (small σ02 and/or large σ2 ) or too informative (large
σ02 and/or small σ2 ), then one would not expect autofocusing to substantially improve design. In
intermediate regimes, autofocusing should be particularly useful. Such a phenomenon is seen in our
experiments (Figure 2). Importantly, this kind of intermediate regime is one in which practitioners
are likely to find themselves: the motivation for design is often sparked by the existence of a few
examples with property values that are exceptional compared to most known examples, yet the design
goal is to push the desired property to be more exceptional still. In contrast, if the true global optimum
already resides in the training data, one cannot hope to design anything better anyway. However, even
in regimes where autofocusing does not help, on average it does not hurt relative to a naive approach
with a fixed oracle (Figure 2 and Supplementary Material §5.1).
5.2 Designing superconductors with maximal critical temperature
Designing superconducting materials with high critical temperatures is an active research problem
that impacts engineering applications from magnetic resonance imaging systems to the Large Hadron
Collider. To assess autofocusing in a more realistic scenario, we used a dataset comprising 21, 263
superconducting materials paired with their critical temperatures [44], the maximum temperature
at which a material exhibits superconductivity. Each material is represented by a feature vector
of length eighty-one, which contains real-valued properties of the material’s constituent elements
(e.g., their atomic radius and valence). We outline our experiments here, with details deferred to the
Supplementary Material §S4.
Unlike in silico validation of a predictive model, one cannot hold out data to validate a design
algorithm because one will not have ground-truth labels for proposed design candidates. Thus,
similarly to [11], we created a “ground-truth” model by training gradient-boosted regression trees [44,
45] on the whole dataset and treating the output as the ground-truth expectation, E[y | x], which can
be called at any time. Next, we generated training data to emulate the common scenario in which
design practitioners have labeled data that are not dense near ground-truth global optima. In particular,
we selected the n = 17, 015 training points from the dataset whose ground-truth expectations were
in the bottom 80th percentile. We used MLE with these points to fit a full-rank multivariate normal,
8
which served as the training distribution, p0 (x), from which we drew n simulated training points,
{xi }ni=1 . For each xi we drew one sample, yi ∼ N (E[y | xi ], 1), to obtain a noisy ground-truth label.
Finally, for our oracle, we used {(xi , yi )}ni=1 to train an ensemble of three neural networks that output
both µβ (x) and σβ2 (x), to provide predictions of the form pβ (y | x) = N (µβ (x), σβ2 (x)) [46].
We ran six different MBO algorithms, each with and without autofocusing, with the goal of designing
materials with maximal critical temperatures. In all cases, we used a full-rank multivariate normal
for the search model, and flattened the importance weights used for autofocusing to wiα [24] with
α = 0.2 to help control variance. The MBO algorithms were: (i) Conditioning by Adaptive Sampling
(CbAS) [11]; (ii) Design by Adaptive Sampling (DbAS) [11]; (iii) reward-weighted regression (RWR)
[23]; (iv) the “feedback" mechanism proposed in [17] (FB); (v) the cross-entropy method used to
optimize probability of improvement (CEM-PI) [11, 32]; and (vi) Covariance Matrix Adaptation
Evolution Strategy (CMA-ES) [47]. These are briefly described in the Supplementary Material §S4.
To quantify the success of each algorithm, we did the following. At each iteration, t, we first computed
the oracle expectations, Eβ (t) [y | x], for each of n samples drawn from the search model, pθ(t) (x).
We then selected the iteration where the 80th percentile of these oracle expectations was greatest. For
that iteration, we computed various summary statistics on the ground-truth expectations of the best
samples, as judged by the oracle from that iteration (i.e., samples with oracle expectations greater
than the 80th percentile; Table 1). See Algorithm 3 in the Supplementary Material for pseudocode
of this procedure. Our evaluation procedure emulates the typical setting in which a practitioner
has limited experimental resources, and can only evaluate the ground truth for the most promising
candidates [4, 5, 6, 7].
Across the majority of evaluation metrics, for all MBO methods, autofocusing a method provided a
statistically significant improvement upon the original method. The percent chances of improvement
(PCI, the percent chances that the best samples had greater ground-truth expectations than the
maximum label in the training data), expose the challenging nature of the design problem. All
methods with no autofocusing had a PCI less than 0.14%, which although small, still reflects a
marked improvement over a naive baseline of simply drawing n new samples from the training
distribution itself, which achieves 5.9 × 10−3 %. Plots of design trajectories from these experiments,
and results from experiments without variance control and with oracle architectures of higher and
lower capacities, can be found in the Supplementary Material (Figures S3 and S4, Table S2).
6 Discussion
We have introduced a new formulation of oracle-based design as a non-zero-sum game. From this
formulation, we developed a new approach for design wherein the oracle—the predictive model that
replaces costly and time-consuming laboratory experiments—is iteratively retrained so as to “autofo-
cus” it on the current region of design candidates under consideration. Our autofocusing approach
can be applied to any design procedure that uses model-based optimization. We recommend using
autofocusing with an MBO method that uses trust regions, such as CbAS [11], which automatically
helps control the variance of the importance weights used for autofocusing. For autofocusing an
MBO algorithm without a trust region, practical use of the oracle gap certificate and/or effective
sample size should be further investigated. Nevertheless, even without these, we have demonstrated
empirically that autofocusing can provide benefits.
Autofocusing can be seen as dynamically correcting for covariate shift as the design procedure
explores design space. It can also be understood as enabling a design procedure to navigate a trade-off
between the bias and variance of the oracle, with respect to the search model distribution. One
extension of this idea is to also perform oracle model selection at each iteration, such that the model
capacity is tailored to the level of importance weight variance.
Further extensions to consider are alternate strategies for estimating the importance weights [27]. In
particular, training discriminative classifiers to estimate these weights may be fruitful when using
search models that are implicit generative models, or whose likelihood cannot otherwise be computed
in closed form, such as variational autoencoders [27, 48]. We believe this may be a promising
approach for applying autofocusing to biological sequence design and other discrete design problems,
which often leverage such models. One can also imagine extensions of autofocusing to gradient-based
design procedures [18]—for example, using techniques for test-time oracle retraining, in order to
evaluate the current point most accurately [49].
9
Acknowledgments and Disclosure of Funding
Many thanks to Sebastián Prillo, David Brookes, Chloe Hsu, Hunter Nisonoff, Akosua Busia, and
Sergey Levine for helpful comments on the work. We are also grateful to the U.S. National Science
Foundation Graduate Research Fellowship Program and the Chan Zuckerberg Investigator Program
for funding.
7 Broader Impact
If adopted more broadly, our work could affect how novel proteins, small molecules, materials, and
other entities are engineered. Because predictive models are imperfect, even with the advances pre-
sented herein, care should be taken by practitioners to verify that any proposed design candidates are
indeed safe and ethical for the intended downstream applications. The machine learning approach we
present facilitates obtaining promising design candidates in a cost-effective manner, but practitioners
must follow up on candidates proposed by our approach with conventional laboratory methods, as
appropriate to the application domain.
References
[1] F. H. Arnold, “Directed evolution: Bringing new chemistry to life,” Angew. Chem., vol. 57,
pp. 4143–4148, Apr. 2018.
[2] J. Hafner, “Ab-initio simulations of materials using VASP: Density-functional theory and
beyond,” J. Comput. Chem., vol. 29, pp. 2044–2078, Oct. 2008.
[3] K. K. Yang, Z. Wu, and F. H. Arnold, “Machine-learning-guided directed evolution for protein
engineering,” Nat. Methods, vol. 16, pp. 687–694, Aug. 2019.
[4] Z. Wu, S. B. J. Kan, R. D. Lewis, B. J. Wittmann, and F. H. Arnold, “Machine learning-assisted
directed protein evolution with combinatorial libraries,” Proc. Natl. Acad. Sci. U. S. A., vol. 116,
pp. 8852–8858, Apr. 2019.
[5] C. N. Bedbrook, K. K. Yang, J. E. Robinson, E. D. Mackey, V. Gradinaru, and F. H. Arnold, “Ma-
chine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics,”
Nat. Methods, vol. 16, pp. 1176–1184, Nov. 2019.
[6] A. Mansouri Tehrani, A. O. Oliynyk, M. Parry, Z. Rizvi, S. Couper, F. Lin, L. Miyagi, T. D.
Sparks, and J. Brgoch, “Machine learning directed search for ultraincompressible, superhard
materials,” J. Am. Chem. Soc., vol. 140, pp. 9844–9853, Aug. 2018.
[7] S. Biswas, G. Khimulya, E. C. Alley, K. M. Esvelt, and G. M. Church, “Low-N protein
engineering with data-efficient deep learning,” bioRxiv, Jan. 2020.
[8] M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learning for de novo drug design,”
Sci Adv, vol. 4, p. eaap7885, July 2018.
[9] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling,
D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, “Auto-
matic Chemical Design Using a Data-Driven Continuous Representation of Molecules,” ACS
Central Science, vol. 4, no. 2, pp. 268–276, 2018.
[10] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence
predictions for unrecognizable images,” in Proc. of the Conference on Computer Vision and
Pattern Recognition (CVPR), 2015.
[11] D. H. Brookes, H. Park, and J. Listgarten, “Conditioning by adaptive sampling for robust design,”
in Proc. of the International Conference on Machine Learning (ICML), 2019.
[12] M. Zlochin, M. Birattari, N. Meuleau, and M. Dorigo, “Model-Based search for combinatorial
optimization: A critical survey,” Ann. Oper. Res., vol. 131, pp. 373–395, Oct. 2004.
[13] D. H. Brookes, A. Busia, C. Fannjiang, K. Murphy, and J. Listgarten, “A view of Estimation of
Distribution Algorithms through the lens of Expectation-Maximization,” in Proc. of the Genetic
and Evolutionary Computation Conference (GECCO), 2020.
[14] J. Linder and G. Seelig, “Fast differentiable DNA and protein sequence optimization for
molecular design,” arXiv, May 2020.
10
[15] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. of the International
Conference on Learning Representations (ICLR), 2014.
[16] D. S. Marks, L. J. Colwell, R. Sheridan, T. A. Hopf, A. Pagnani, R. Zecchina, and C. Sander,
“Protein 3d structure computed from evolutionary sequence variation,” PLOS One, vol. 6,
pp. 1–20, 12 2011.
[17] A. Gupta and J. Zou, “Feedback GAN for DNA optimizes protein functions,” Nature Machine
Intelligence, vol. 1, pp. 105–111, Feb. 2019.
[18] N. Killoran, L. J. Lee, A. Delong, D. Duvenaud, and B. J. Frey, “Generating and designing
DNA with deep generative models,” in Neural Information Processing Systems (NeurIPS)
Computational Biology Workshop, 2017.
[19] E. Bengoetxea, P. Larrañaga, I. Bloch, and A. Perchant, “Estimation of distribution algorithms: A
new evolutionary computation approach for graph matching problems,” in Energy Minimization
Methods in Computer Vision and Pattern Recognition (M. Figueiredo, J. Zerubia, and A. K.
Jain, eds.), pp. 454–469, Springer Berlin Heidelberg, 2001.
[20] S. Baluja and R. Caruana, “Removing the genetics from the standard genetic algorithm,” in
Machine Learning Proceedings 1995 (A. Prieditis and S. Russell, eds.), pp. 38–46, San Francisco
(CA): Morgan Kaufmann, 1995.
[21] R. Y. Rubinstein, “Optimization of computer simulation models with rare events,” European
Journal of Operational Research, vol. 99, no. 1, pp. 89–112, 1997.
[22] R. Rubinstein, “The Cross-Entropy Method for Combinatorial and Continuous Optimization,”
Methodology And Computing In Applied Probability, vol. 1, no. 2, pp. 127–190, 1999.
[23] J. Peters and S. Schaal, “Reinforcement learning by reward-weighted regression for operational
space control,” in Proc. of the 24th International Conference on Machine Learning (ICML),
2007.
[24] M. Sugiyama, M. Krauledat, and K.-R. Müller, “Covariate shift adaptation by importance
weighted cross validation,” J. Mach. Learn. Res., vol. 8, pp. 985–1005, May 2007.
[25] J. Quiñonero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence, eds., Dataset Shift in
Machine Learning. Neural Information Processing Series, MIT Press, 2008.
[26] A. B. Owen, Monte Carlo Theory, Methods and Examples. 2013.
[27] M. Sugiyama, T. Suzuki, and T. Kanamori, Density Ratio Estimation in Machine Learning.
Cambridge University Press, Feb. 2012.
[28] A. Kumar and S. Levine, “Model inversion networks for model-based optimization,” arXiv, Dec.
2019.
[29] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular de-novo design through
deep reinforcement learning,” J. Cheminform., vol. 9, p. 48, Sept. 2017.
[30] G. Hautier, C. C. Fischer, A. Jain, T. Mueller, and G. Ceder, “Finding nature’s missing ternary
oxide compounds using machine learning and density functional theory,” Chem. Mater., vol. 22,
pp. 3762–3767, June 2010.
[31] J. Linder, N. Bogard, A. B. Rosenberg, and G. Seelig, “A generative neural network for
maximizing fitness and diversity of synthetic DNA and protein sequences,” Cell Syst, vol. 11,
pp. 49–62.e16, July 2020.
[32] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning
algorithms,” in Advances in Neural Information Processing Systems (NeurIPS), 2012.
[33] Y. Jin, “Surrogate-assisted evolutionary computation: Recent advances and future challenges,”
Swarm and Evolutionary Computation, vol. 1, pp. 61–70, June 2011.
[34] M. N. Le, Y. S. Ong, S. Menzel, Y. Jin, and B. Sendhoff, “Evolution by adapting surrogates,”
Evol. Comput., vol. 21, no. 2, pp. 313–340, 2013.
[35] M. D. Schmidt and H. Lipson, “Coevolution of fitness predictors,” IEEE Trans. Evol. Comput.,
vol. 12, pp. 736–749, Dec. 2008.
[36] C. Angermueller, D. Dohan, D. Belanger, R. Deshpande, K. Murphy, and L. Colwell, “Model-
based reinforcement learning for biological sequence design,” in Proc. of the International
Conference on Learning Representations (ICLR), 2019.
11
[37] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review,
and perspectives on open problems,” arXiv, May 2020.
[38] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful
of trials using probabilistic dynamics models,” in Advances in Neural Information Processing
Systems (NeurIPS), 2018.
[39] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to
policy search,” in Proc. of the 28th International Conference on Machine Learning (ICML),
2011.
[40] N. Rhinehart, R. McAllister, and S. Levine, “Deep imitative models for flexible inference,
planning, and control,” in Proc. of the International Conference on Learning Representations
(ICLR), 2020.
[41] D. Precup, R. S. Sutton, and S. Dasgupta, “Off-policy temporal-difference learning with function
approximation,” in Proc. of the International Conference on Machine Learning (ICML), 2001.
[42] P. S. Thomas and E. Brunskill, “Data-Efficient Off-Policy policy evaluation for reinforcement
learning,” in Proc. of the International Conference on Machine Learning (ICML), 2016.
[43] J. C. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt, “Performative prediction,” in Proc.
of the International Conference on Machine Learning (ICML), 2020.
[44] K. Hamidieh, “A data-driven statistical model for predicting the critical temperature of a
superconductor,” Comput. Mater. Sci., vol. 154, pp. 346–354, Nov. 2018.
[45] V. Stanev, C. Oses, A. G. Kusne, E. Rodriguez, J. Paglione, S. Curtarolo, and I. Takeuchi, “Ma-
chine learning modeling of superconducting critical temperature,” npj Computational Materials,
vol. 4, p. 29, June 2018.
[46] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty
estimation using deep ensembles,” in Advances in Neural Information Processing Systems
(NeurIPS), 2017.
[47] N. Hansen, “The CMA evolution strategy: A comparing review,” in Towards a New Evolutionary
Computation: Advances in the Estimation of Distribution Algorithms (J. A. Lozano, P. Larrañaga,
I. Inza, and E. Bengoetxea, eds.), pp. 75–102, Springer Berlin Heidelberg, 2006.
[48] A. Grover, J. Song, A. Kapoor, K. Tran, A. Agarwal, E. J. Horvitz, and S. Ermon, “Bias
correction of learned generative models using Likelihood-Free importance weighting,” in
Advances in Neural Information Processing Systems (NeurIPS), 2019.
[49] Y. Sun, X. Wang, Z. Liu, J. Miller, A. A. Efros, and M. Hardt, “Test-Time training for Out-of-
Distribution generalization,” arXiv, Sept. 2019.
[50] A. M. Metelli, M. Papini, F. Faccio, and M. Restelli, “Policy optimization via importance
sampling,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
[51] T. van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE
Trans. Inf. Theory, vol. 60, pp. 3797–3820, July 2014.
[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of the
International Conference on Learning Representations (ICLR), 2015.
12
Supplementary Material
S1 Pseudocode
Algorithm 1 gives pseudocode for autofocusing a broad class of model-based optimization (MBO)
algorithms known as estimation of distribution algorithms (EDAs), which can be seen as performing
Monte-Carlo expectation-maximization [13]. EDAs proceed at each iteration with a sampling-based
“E-step” (Steps 1 and 2 in Algorithm 1) and a weighted maximum likelihood estimation (MLE)
“M-step” (Step 3; see [13] for more details). Different EDAs are distinguished by method-specific
monotonic transformations V (·), which determine the sample weights used in the M-step. In some
EDAs, this transformation is not explicitly defined, but rather implicitly implemented by constructing
and using a sequence of relaxed constraint sets, S (t) , such that S (t) ⊇ S (t+1) ⊇ S [21, 22, 11].
Algorithm 2 gives pseudocode for autofocusing a particular EDA, Conditioning by Adaptive Sampling
(CbAS) [11], which uses such a sequence of relaxed constraint sets, as well as M-step weights that
induce an implicit trust region for the search model update. For simplicity, the algorithm is instantiated
with the design goal of maximizing the property of interest. It can easily be generalized to the goal
of achieving a specific value for the property, or handling multiple properties (see Sections S2-3 of
[11]).
Use of [.] in the pseudocode denotes an optional input argument with default values.
13
Algorithm 2: Autofocused Conditioning by Adaptive Sampling (CbAS)
Input : Training data, {(xi , yi )}ni=1 ; oracle model class, pβ (y | x) with parameters, β,
that can be estimated with MLE; search model class, pθ (x), with parameters, θ,
that can be estimated with weighted MLE or approximations thereof; maximum
number of iterations, T ; number of samples to generate, m; [percentile threshold,
Q = 90].
Initialization : Obtain p0 (x) by fitting to {xi }ni=1 with the search model class. For the search
model, set pθ(0) (x) ← p0 (x). For the oracle, pβ (0) (y | x), use MLE with equally
weighted training data. Set γ0 = −∞.
begin
for t = 1, . . . , T do
(t)
1. Sample from the current search model, x̃i ∼ pθ(t−1) (x), ∀i ∈ {1, . . . , m}.
(t)
2. qt ← Q-th percentile of the oracle expectations of the samples, {µβ (x̃i )}m
i=1
3. γt ← max{γt−1 , qt }
(t) (t) (t)
4. vi ← (p0 (x̃i )/pθ(t−1) (x̃i ))Pβ (t−1) (y ≥ γt | x̃i ), ∀i ∈ {1, . . . , m}
(t)
5. Fit the updated search model, pθ(t) (x), using weighted MLE with the samples, {x̃i }m i=1 ,
and their corresponding EDA weights, {vi }m i=1 .
6. Compute importance weights for the training data, wi ← pθ(t) (xi )/pθ(0) (xi ), i = 1, . . . , n.
7. Retrain the oracle using the re-weighted training data,
n
1X
β (t) ← arg max wi log pβ (yi | xi ).
β∈B n i=1
14
S2 Proofs, derivations, and supplementary results
Proof of Proposition 1. For any distribution pθ (x), if
Epθ (x) [DKL (p(y | x) || pφ (y | x))] ≤ , (7)
then it holds that
Epθ (x) |P (y ∈ S | x) − Pφ (y ∈ S | x)|2 ≤ Epθ (x) δ(p(y | x), pφ (y | x))2
(8)
1
≤ Epθ (x) [DKL (p(y | x) || pφ (y | x))] (9)
2
≤ . (10)
2
where δ(p, q) is the total variation distance between probability distributions p and q, and the second
inequality is due to Pinsker’s inequality. Finally, applying Jensen’s inequality yields
r
Epθ (x) [|P (y ∈ S | x) − Pφ (y ∈ S | x)|] ≤ . (11)
2
Here, we derive the descent step of the alternating ascent-descent algorithm described in §3.2. At
iteration t, given the search model parameters, θ(t) , our goal is to update the oracle parameters
according to
β (t) = arg min Epθ(t) (x) [DKL (p(y | x) || pβ (y | x))]. (12)
β∈B
Note that
Z Z
β (t) = arg min Epθ(t) (x) p(y | x) log p(y | x)dy − p(y | x) log pβ (y | x)dy (13)
β∈B R R
Z
= arg max Epθ(t) (x) p(y | x) log pβ (y | x)dy (14)
β∈B R
= arg max Epθ(t) (x) Ep(y|x) [log pβ (y | x)]. (15)
β∈B
We cannot query the ground truth, p(y | x), but we do have labeled training data, {(xi , yi )}ni=1 ,
where xi ∼ p0 (x) and yi ∼ p(y | x = xi ) by definition. We therefore leverage importance sampling,
using p0 (x) as the proposal distribution, to obtain
(t) pθ(t) (x)
β = arg max Ep0 (x) Ep(y|x) log pβ (y | x) . (16)
β∈B p0 (x)
Finally, we instantiate an importance sampling estimate of the objective in Equation 16 with our
labeled training data, to get a practical oracle parameter update,
n
1 X pθ(t) (xi )
β (t) = arg max log pβ (yi | xi ). (17)
β∈B n i=1 p0 (xi )
This update is equivalent to fitting the oracle parameters, β (t) , by performing weighted MLE
with the labeled training data, {(xi , yi )}ni=1 , and corresponding weights, {wi }ni=1 , where wi :=
pθ(t) (xi )/p0 (xi ).
The importance-sampled estimate of the log-likelihood used to retrain the oracle (Equation 17) is
unbiased, but may have high variance due to the variance of the importance weights. To assess the
reliability of the importance-sampled estimate, alongside the effective sample size described in §3.3,
one can also monitor confidence intervals on some loss of interest. Let Lβ : X × R → R denote
a pertinent loss function induced by the oracle parameters, β, (e.g., the squared error Lβ (x, y) =
(Eβ [y | x] − y)2 ). The following observation is due to Chebyshev’s inequality.
15
Proposition S2.1. Suppose that Lβ : X × R → R is a bounded loss function, such that |Lβ (x, y)| ≤
L for all x, y, and that pθ p0 . Let {(xi , yi )}ni=1 be labeled training data such that the xi ∼ p0 (x)
are drawn independently and yi ∼ p(y | x = xi ) for each i. For any δ ∈ (0, 1] and any n > 0, with
probability at least 1 − δ it holds that
n
r
1 X pθ (xi ) d2 (pθ || p0 )
Epθ (x) Ep(y|x) [Lβ (x, y)] − Lβ (xi , yi ) ≤ L (18)
n i=1 p0 (xi ) nδ
where d2 is the exponentiated Rényi-2 divergence, i.e., d2 (pθ || p0 ) = Ep0 (x) [(pθ (x)/p0 (x))2 ].
Proof. We use the following lemma to bound the variance of the importance sampling estimate of
the loss. Chebyshev’s inequality then yields the desired result.
Lemma S2.1 (Adaptation of Lemma 4.1 in Metelli et al. (2018) [50]). Under the same assumptions
as Proposition S2.1, the joint distribution pθ (x)p(y | x) is absolutely continuous with respect to the
joint distribution p0 (x)p(y | x). Then for any n > 0, it holds that
" n #
1 X pθ (xi ) 1
Varpθ (x)p(y|x) Lβ (xi , yi ) ≤ L2 d2 (pθ ||p0 ). (19)
n i=1 p0 (xi ) n
One can use Proposition S2.1 to construct a confidence interval on, for example, the expected squared
error between the oracle and the ground-truth values with respect to pθ (x), using the labeled training
data onPhand. The Rényi divergence can be estimated using, for example, the plug-in estimate
n
(1/n) i=1 (pθ (xi )/p0 (xi ))2 . While the bound, L, on Lβ may be restrictive in general, for any
given application one may be able to use domain-specific knowledge to estimate L. For example,
in designing superconducting materials with maximized critical temperature, one can use an oracle
architecture whose outputs are non-negative and at most some plausible maximum value M (in
degrees Kelvin) according to superconductivity theory; one could then take L = M 2 for squared
error loss. Computing a confidence interval at each iteration of a design procedure then allows one to
monitor the error of the retrained oracle.
Monitoring such confidence intervals, or the effective sample size, is most likely to be useful for
design procedures that do not have in-built mechanisms for restricting the movement of the search
distribution away from the training distribution. Various algorithmic interventions are possible—one
could simply terminate the procedure if the error bounds, or effective sample size, surpass some
threshold, or one could decide not to retrain the oracle for that iteration. For simplicity and clarity
of exposition, we did not use any such interventions in this paper, but we mention them as potential
avenues for further improving autofocusing in practice. Note that 1) the bound in Proposition S2.1 is
only useful if the importance weight variance is finite, and 2) estimating the bound itself requires use
of the importance weights, and thus may also be susceptible to high variance. It may therefore be
advantageous to use a liberal criterion for any interventions.
CbAS naturally controls the importance weight variance. Design procedures that leverage a
trust region can naturally bound the variance of the importance weights. For instance, CbAS [11],
developed in the context of an oracle with fixed parameters, β, proposes estimating the training
distribution conditioned on S as the search model:
pθ (x) = p0 (x | S) = Pβ (S | x)p0 (x)/P0 (S), (20)
R
where P0 (S) = Pβ (S | x)p0 (x)dx. This prescribed search model yields the following importance
weight variance.
Proposition S2.2. For pθ (x) = p0 (x | S), it holds that
pθ (x) 1
Varp0 (x) = − 1. (21)
p0 (x) P0 (S)
That is, so long as S has non-neglible mass under data drawn from the training distribution, p0 (x), we
have reasonable control on the variance of the importance weights. Of course, if P0 (S) is too small,
this bound is not useful, but to have any hope of successful data-driven design it is only reasonable to
expect this quantity to be non-negligible. This is similar to the experimental requirement, in directed
evolution for protein design, that the initial data exhibit some “minimal functionality” with regards to
the property of interest [3].
16
Proof. The variance of the importance weights can be written as
p0 (x | S)
Varp0 (x) = d2 (p0 (x | S)||p0 (x)) − 1, (22)
p0 (x)
where d2 (p0 (x | S)||p0 (x)) = Ep0 (x) [(p0 (x | S)/p0 (x))2 ] is the exponentiated Rényi-2 divergence.
Then we have
pθ (x) 1
Varp0 (x) = d2 (p0 (x | S)||p0 (x)) − 1 = − 1, (23)
p0 (x) p0 (S)
where the second equality is due to the property in Example 1 of [51].
This variance yields the following expression for the population version of the effective sample size:
2
nEp0 (x) [pθ (x)/p0 (x)] n
n∗e := = = nP0 (S). (24)
Ep0 (x) [(pθ (x)/p0 (x))2 ] Ep0 (x) [(pθ (x)/p0 (x))2 ]
Furthermore, CbAS proposes an iterative procedure to estimate pθ (x). At iteration t, the algo-
rithm seeks a variational approximation to p(t) (x) ∝ Pβ (S (t) | x)p0 (x), where S (t) ⊇ S. Since
P0 (S (t) | x) ≥ P0 (S | x), the expressions above for the importance weight variance and effective
sample size for the final search model prescribed by CbAS translate into upper and lower bounds,
respectively, on the importance weight variance and effective sample size for the distributions p(t) (x)
prescribed at each iteration.
S3 An illustrative example
S3.1 Experimental details
Ground truth and oracle. For the ground-truth function f : R → R+ , we used the sum of the
densities of two Gaussian distributions, N1 (5, 1) and N2 (7, 0.25). For the expectation of the oracle
model, µβ (x) := Eβ [y | x], we used kernel ridge regression with a radial basis function kernel as
implemented in scikit-learn, with the default values for all hyperparameters. The variance of the
oracle model, σβ2 := Varβ [y | x], was set to the mean squared error between µβ (x) and the training
data labels, as estimated with 4-fold importance-weighted cross-validation when autofocusing [24].
MBO algorithm. We used CbAS as follows. At iteration t = 1, . . . , 100, similar to [11], we used
the relaxed constraint set S (t) = {y : y ≥ γt } where γt was the tth percentile of the oracle expectation,
µβ (x), when evaluated over x ∈ [0, 10]. At the final iteration, t = 100, the constraint set is equivalent
to the design goal of maximizing the oracle expectation, S (100) = S = {y : y ≥ maxx µβ (x)},
which is the oracle-based proxy to maximizing the ground-truth function, f (x). At each iteration, we
used numerical quadrature (scipy.integrate.quad) to compute the search model,
Pβ (t) (y ∈ S (t) | x) p0 (x)
p(t) (x) = R . (25)
P (y ∈ S (t) | x) p0 (x)
X β (t)
Numerical integration enabled us to use CbAS without a parametric search model, which otherwise
would have been used to find a variational approximation to this distribution [11]. We also used
numerical integration to compute the value of the model-based design objective (Equation 1) achieved
by the final search model, both with and without autofocusing.
For all tested settings of the variance of the training distribution, σ02 , and the variance of the label
noise, σ2 , autofocusing yielded positive improvement to the model-based design objective (Equation
1) on average over 50 trials (Figure 2). For a more comprehensive understanding of the effects of
autofocusing, here we pinpoint specific trials where autofocusing decreased the objective, compared
to a naive approach with a fixed oracle. Such trials were rare, and occurred in regimes where one
17
would not reasonably expect autofocusing to provide a benefit. In particular, as discussed in §5.1,
such regimes include when σ02 is too small, such that training data are unlikely to be close to the
global maximum, and when σ02 is too large, such that the training data already include points around
the global maximum and a fixed oracle should be suitable for successful design. Similarly, when the
label noise variance, σ2 , is too large, the training data are no longer informative and no procedure
should hope to perform well systematically. We now walk through each of these regimes.
When σ02 was small and there was no label noise, we observed a few trials where the final search
model placed less mass under the global maximum with autofocusing than without. This effect
was due to increased standard deviation of the autofocused oracle, induced by high variance of the
importance weights (Figure S1a). When σ02 was small and σ2 was extremely large, a few trials
yielded lower final objectives with autofocusing by insignificant margins; in such cases, the label
noise was overwhelming enough that the search model did not move much anyway, either with or
without autofocusing (Figure S1b). Similarly, when σ02 was large and there was no label noise, a few
trials yielded lower final objectives with autofocusing than without, by insignificant margins (Figure
S1c).
Interestingly, when the variances of both the training distribution and label noise were high, autofo-
cusing yielded positive improvement for all trials. In this regime, by encouraging the oracle to fit
most accurately to the points with the highest labels, autofocusing resulted in search models with
greater mass under the global maximum than the fixed-oracle approach, which was more influenced
by the extreme label noise (Figure S1d).
As discussed in §5.1, in practice it is often the case that 1) practitioners can collect reasonably infor-
mative training data for the application of interest, such that some exceptional examples are measured
(corresponding to sufficiently large σ02 ), and 2) there is always label noise, due to measurement error
(corresponding to non-zero σ2 ). Thus, we expect many design applications in practice to fall in the
intermediate regime where autofocusing tends to yield positive improvements over a fixed-oracle
approach (Figure 2, Table 1).
Pre-processing. Each of the 21, 263 materials in the superconductivity data from [44] is represented
by a vector of eighty-one real-valued features. We zero-centered and normalized each feature to have
unit variance.
Ground-truth model. To construct the model of the ground-truth expectation, E[y | x], we fit
gradient-boosted regression trees using xgboost and the same hyperparameters reported in [44],
which selected them using grid search. The one exception was that we used 200 trees instead of 750
trees, which yielded a hold-out root mean squared error (RMSE) of 9.51 compared to the hold-out
RMSE of 9.5 reported in [44]. To remove collinear features noted in [44], we also performed feature
selection by thresholding xgboost’s in-built feature weights, which quantifies how many times a
feature is used to split the data across all trees. We kept the sixty most important features according
to this score, which decreased the hold-out RMSE from 9.51 when using all the features to 9.45. The
resulting input space for design was then X = R60 .
Training distribution. To construct the training distribution, we selected the 17, 015 points from
the dataset whose ground-truth expectations were below the 80th percentile (equivalent to 73.8 degrees
Kelvin, compared to the maximum of 138.3 degrees Kelvin in the full dataset). We used MLE with
these points to fit a full-rank multivariate normal, which served as the training distribution, p0 (x),
from which we drew n = 17, 015 simulated training points, {xi }ni=1 , for each trial. For each xi
we drew one sample, yi ∼ N (E[y | xi ], 1), to obtain a noisy ground-truth label. This training
distribution produced simulated training points with a distribution of ground-truth expectations,
E[y | x], reasonably comparable to that of the points from the original dataset (Figure S2, left panel).
Oracle. For the oracle, we trained an ensemble of three neural networks to maximize log-likelihood
according to the method described in [46] (without adversarial examples). Each model in the
ensemble had the architecture Input(60) → Dense(100) → Dense(100) → Dense(100) →
18
(a) Example trial with low-variance training distribution and no label noise, (σ0 , σ ) = (1.6, 0).
(b) Example trial with low-variance training distribution and high label noise, (σ0 , σ ) = (1.6, 0.38).
(c) Example trial with high-variance training distribution and no label noise (σ0 , σ ) = (2.2, 0).
(d) Example trial with high-variance training distribution and high label noise (σ0 , σ ) = (2.2, 0.38).
Figure S1: Examples of regimes where autofocus (AF) sometimes yielded lower final objectives
than without (non-AF). Each row shows snapshots of CbAS in a different experimental regime,
from initialization (leftmost panel), to an intermediate iteration (middle panel), to the final iteration
(rightmost panel). As in Figure 1, the vertical axis represents both y values (for the oracle and ground
truth) and probability density values (of the training distribution, p0 (x), and search distributions,
pθ(t) (x)). Shaded envelopes correspond to ±1 standard deviation of the oracles, σβ (t) , with the
oracle expectations, µβ (t) (x), shown as a solid line. Greyscale of training points corresponds to
their importance weights used in autofocusing. In the rightmost panels, for easy visualization of the
final search models achieved with and without AF, the stars and dotted horizontal lines indicate the
ground-truth values corresponding to the points of maximum density.
19
Figure S2: Training distribution and initial oracle for designing superconductors. Simulated training
data were generated from a training distribution, p0 (x), which was a multivariate Gaussian fit to data
points with ground-truth expectations below the 80th percentile. The left panel shows histograms
of the ground-truth expectations of these original data points, and the ground-truth expectations
of simulated training data. The right panel illustrates the error of an initial oracle used in the
experiments, by plotting the ground-truth and predicted labels of 10, 000 test points drawn from the
training distribution. The RMSE here was 7.31.
Dense(100) → Dense(10) → Dense(2), with elu nonlinearities everywhere except for linear
output units. Out of the range of hidden layer numbers and sizes we tested, this architecture minimized
RMSE on held-out data. Each model was trained using Adam [52] with a learning rate of 5 × 10−4
for a maximum of 2000 epochs, with a batch size of 64 and early stopping based on the log-likelihood
of a validation set. Across the 10 trials, the initial oracles had hold-out RMSEs between 6.95 and
7.40 degrees Kelvin (Figure S2, right panel).
Autofocusing. During autofocusing, each model in the oracle ensemble was retrained with the
re-weighted training data, using the same optimization hyperparameters as the initial oracle, except
early stopping was based on the re-weighted log-likelihood of the validation set. For the results in
Table 1, to help control the variance of the importance weights, we flattened the importance weights
to wiα where α = 0.2 [24] and also self-normalized them [26]. We found that autofocusing yielded
similarly widespread benefits for a wide range of values of α, including α = 1, which corresponds to
a “pure” autofocusing strategy without variance control (Table S2).
MBO algorithms. Here, we provide a brief description of the different MBO algorithms used in
the superconductivity experiments (Tables 1, S2, S3, S4, Figures S3 and S4). Wherever applicable, in
parentheses we anchor these descriptions in the notation and procedure of Algorithm 1.
• Design by Adaptive Sampling (DbAS) [11]. A basic EDA that anneals a sequence of relaxed
constraint sets, S (t) , such S (t) ⊇ S (t+1) ⊇ S, to iteratively solve the oracle-based MBD
(t) (t)
problem (Equation 2). (At iteration t, DbAS uses V (x̃i ) = Pβ (t−1) (y ∈ S (t) | x̃i ).)
• Conditioning by Adaptive Sampling (CbAS) [11]. Seeks to estimate the training distribu-
tion conditioned on the desired constraint set S (Equation 20). Similar mechanistically
to DbAS, as it involves constructing a sequence of relaxed constraint sets, but also incor-
porates an implicit trust region based on the training distribution. (At iteration t, CbAS
(t) (t) (t) (t)
uses V (x̃i ) = (p0 (x̃i )/pθ(t−1) (x̃i ))Pβ (t−1) (y ∈ S (t) | x̃i ). See Algorithm 2; non-
autofocused CbAS excludes Steps 6 and 7.)
• Reward-Weighted Regression (RWR) [23]. An EDA used in the reinforcement
(t) Pm
learning community. (At iteration t, RWR uses V (x̃i ) = vi0 / j=1 vj0 , where
(t)
vi0 = exp(γEβ (t−1) [y | x̃i ])) and γ > 0 is a hyperparameter).
• “Feedback” Mechanism (FB) [17]. A heuristic version of CbAS, which maintains samples
from previous iterations to prevent the search model from changing too rapidly. (At Step
3 in Algorithm 1, FB uses samples from the current iteration with oracle expectations that
surpass some percentile threshold, along with a subset of promising samples from previous
iterations.)
20
Table S2: Designing superconducting materials. Same experiments and caption as Table 1, except
with α = 1 (no flattening of the importance weights to control variance). We ran six different MBO
methods, each with and without autofocusing. For each method, we extracted those samples with
oracle expectations above the 80th percentile and computed their ground-truth expectations. We
report the median and maximum of those ground-truth expectations (both in degrees K), their percent
chance of improvement (PCI, in percent) over the maximum label in the training data, as well as
the Spearman correlation (ρ) and root mean squared error (RMSE, in degrees K) between the oracle
and ground-truth expectations. Each reported score is averaged over 10 trials, where, in each trial,
a different training set was sampled from the training distribution. “Mean Diff.” is the average
difference between the score when using autofocusing compared to not. Bold values with one star
(*) and two stars (**), respectively, mean p-value < 0.05 and < 0.01 from a two-sided Wilcoxon
signed-rank test on the 10 paired score differences. For all scores but RMSE, a higher value means
autofocusing yielded better results (as indicated by the arrow ↑); for RMSE, the opposite is true (as
indicated by the arrow ↓).
Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓ Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓
CbAS DbAS
Original 51.5 103.8 0.11 0.05 17.2 57.0 98.4 0.11 0.01 29.6
Autofocused 73.2 116.0 2.29 0.56 12.8 69.4 109.9 0.68 0.01 27.4
Mean Diff. 21.8** 12.2** 2.18** 0.51** -4.4** 12.4** 11.5** 0.58** 0.01 -2.2
RWR FB
Original 43.4 102.0 0.05 0.92 7.4 9.2 100.6 0.14 40.09 17.5
Autofocused 68.5 113.4 1.34 0.63 14.2 63.4 110.8 0.63 0.49 11.2
Mean Diff. 25.1** 11.5** 1.30** -0.29** 6.8** 14.2** 10.2* 0.50** 0.40** -6.3**
CEM-PI CMA-ES
Original 34.5 55.8 0.00 -0.16 148.3 42.1 69.4 0.00 0.27 27493.2
Autofocused 59.5 89.1 0.39 0.02 46.9 45.1 70.7 0.00 0.50 27944.9
Mean Diff. 25.0* 33.3* 0.39* 0.18 -101.4** 3.1 1.2 0 0.22* 451.7
CbAS, DbAS, FB, and CEM-PI all have hyperparameters corresponding to a percentile threshold (for
CbAS and DbAS, this is used to construct the relaxed constraint sets). We set this hyperparameter
to 90 for all these methods. For RWR, we set γ = 0.01, and for CMA-ES, we set the step size
hyperparameter to σ = 0.01.
Importance weight variance control. To see how much importance weight variance affects aut-
ofocusing, we conducted the same experiments as Table 1, except without flattening the weights
to reduce variance (Table S2). For CbAS, DbAS, RWR, FB, and CEM-PI, autofocusing without
variance control yielded statistically significant improvements to the majority of scores, though with
somewhat lesser effect sizes than in Table 1 when the weights were flattened with α = 0.2. For
CMA-ES, the only significant improvement autofocusing rendered was to the Spearman correlation
between the oracle and the ground-truth expectations. Note that CMA-ES is a state-of-the-art method
for optimizing a given objective with a multivariate Gaussian search model [47], which likely led
to liberal movement of the search model away from the training distribution and therefore high
importance weight variance.
Oracle capacity. To see how different oracle capacities affect the improvements gained from auto-
focusing, we ran the same experiments as Table 1 with two different oracle architectures. One architec-
21
(a) CbAS.
(b) DbAS.
(c) RWR.
(d) FB.
Figure S3: Designing superconducting materials. Trajectories of different MBO algorithms run
without (left) and with autofocusing (right), on one example trial used to compute Table 1. At each
iteration, we extract the samples with oracle expectations greater than the 80th percentile. For these
samples, solid lines give the median oracle (green) and ground-truth (indigo) expectations. The
shaded regions capture 70 and 95 percent of these quantities. The RMSE at each iteration is between
the oracle and ground-truth expectations of all samples. The horizontal axis is sorted by increasing
80th percentile of oracle expectations (i.e., the samples plotted at iteration 1 are from the iteration
whose 80th percentile of oracle expectations was lowest). This ordering exposes the trend of whether
the oracle expectations of samples were correlated to their ground-truth expectations. Two more
algorithms are shown in Figure S4.
22
(a) CEM-PI.
(b) CMA-ES.
Figure S4: Designing superconducting materials. Continuation of Figure S3.
ture had higher capacity than the original oracle (hidden layer sizes of (200, 200, 100, 100, 10)
compared to (100, 100, 100, 100, 10); Table S3), and one one had lower capacity (hidden
layer sizes of (100, 100, 10); Table S4).
23
Table S3: Designing superconducting materials. Same experiments and caption as Table 1, except
using an oracle architecture with hidden layers 200 → 200 → 100 → 100 → 10. We ran six
different MBO methods, each with and without autofocusing. For each method, we extracted
those samples with oracle expectations above the 80th percentile and computed their ground-truth
expectations. We report the median and maximum of those ground-truth expectations (both in degrees
K), their percent chance of improvement (PCI, in percent) over the maximum label in the training
data, as well as the Spearman correlation (ρ) and root mean squared error (RMSE, in degrees K)
between the oracle and ground-truth expectations. Each reported score is averaged over 10 trials,
where, in each trial, a different training set was sampled from the training distribution. “Mean Diff.”
is the average difference between the score when using autofocusing compared to not. Bold values
with one star (*) and two stars (**), respectively, mean p-value < 0.05 and < 0.01 from a two-sided
Wilcoxon signed-rank test on the 10 paired score differences. For all scores but RMSE, a higher
value means autofocusing yielded better results (as indicated by the arrow ↑); for RMSE, the opposite
is true (as indicated by the arrow ↓).
Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓ Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓
CbAS DbAS
Original 48.3 100.8 0.05 0.03 19.6 55.3 98.6 0.025 -0.02 32.1
Autofocused 79.0 119.4 4.35 0.55 13.5 81.6 113.3 5.33 0.01 27.0
Mean Diff. 30.7** 18.6** 4.30** 0.52** -6.1** 26.4** 14.8** 5.30** 0.03 -5.1
RWR FB
Original 36.5 81.3 0.00 -0.24 55.5 47.8 101.5 0.09 0.06 18.3
Autofocused 73.4 114.8 2.05 0.72 12.7 63.5 113.1 0.58 0.58 10.7
Mean Diff. 36.9** 33.4** 2.05** 0.97** -42.8** 15.7** 11.7** 0.49** 0.51** -7.5**
CEM-PI CMA-ES
Original 48.2 58.3 0.00 0.09 271.4 39.0 63.1 0.00 0.26 6774.6
Autofocused 64.5 84.1 0.48 -0.14 61.07 53.1 79.0 0.01 0.48 10183.7
Mean Diff. 16.3 25.9* 0.48 -0.22 -210.3 14.1* 15.9* 0.01 0.23 3409.1
Table S4: Designing superconducting materials. Same experiments and caption as Table 1, except
using an oracle architecture with hidden layers 100 → 100 → 10. We ran six different MBO
methods, each with and without autofocusing. For each method, we extracted those samples with
oracle expectations above the 80th percentile and computed their ground-truth expectations. We
report the median and maximum of those ground-truth expectations (both in degrees K), their percent
chance of improvement (PCI, in percent) over the maximum label in the training data, as well as
the Spearman correlation (ρ) and root mean squared error (RMSE, in degrees K) between the oracle
and ground-truth expectations. Each reported score is averaged over 10 trials, where, in each trial,
a different training set was sampled from the training distribution. “Mean Diff.” is the average
difference between the score when using autofocusing compared to not. Bold values with one star
(*) and two stars (**), respectively, mean p-value < 0.05 and < 0.01 from a two-sided Wilcoxon
signed-rank test on the 10 paired score differences. For all scores but RMSE, a higher value means
autofocusing yielded better results (as indicated by the arrow ↑); for RMSE, the opposite is true (as
indicated by the arrow ↓).
Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓ Median ↑ Max ↑ PCI ↑ ρ↑ RMSE ↓
Original 0.06 46.8 98.5 -0.03 23.8 0.02 56.3 97.7 0.00 37.0
Autofocused 1.4 67.0 114.3 0.52 13.0 1.3 72.5 108.4 0.04 27.6
Mean Diff. 1.3** 20.2** 15.8** 0.55** -10.9** 1.3** 16.2** 10.7** 0.03 -9.4**
RWR FB
Original 0.00 30.9 76.8 -0.33 83.5 0.04 47.2 100.4 0.02 19.9
Autofocused 0.68 66.0 112.6 0.57 18.3 0.43 58.2 111.4 0.50 12.3
Mean Diff. 0.68** 35.1** 35.8** 0.90** -65.2** 0.40** 11.0** 11.0** 0.48** -7.6**
CEM-PI CMA-ES
Original 0.00 36.3 46.2 -0.01 382.4 0.00 36.9 62.3 0.10 9587.1
Autofocused 0.04 53.9 71.3 -0.04 210.8 0.00 43.9 80.0 0.29 40858.6
Mean Diff. 0.04 17.6 25.1* -0.03 -171.6 0 7.0* 17.7* 0.19** 31271.5**
24