Adaptive Sampling Bi-Fidelity Stochastic Trust Region Method For Derivative-Free Stochastic Optimization
Adaptive Sampling Bi-Fidelity Stochastic Trust Region Method For Derivative-Free Stochastic Optimization
Abstract
Bi-fidelity stochastic optimization is increasingly favored for streamlining optimization pro-
cesses by employing a cost-effective low-fidelity (LF) function, with the goal of optimizing a
more expensive high-fidelity (HF) function. In this paper, we introduce ASTRO-BFDF, a
new adaptive sampling trust region method specifically designed for solving unconstrained
bi-fidelity stochastic derivative-free optimization problems. Within ASTRO-BFDF, the LF
function serves two purposes: first, to identify better iterates for the HF function when a high
correlation between them is indicated by the optimization process, and second, to reduce the
variance of the HF function estimates by Bi-fidelity Monte Carlo (BFMC). In particular, the
sample sizes are dynamically determined with the option of employing either crude Monte
Carlo or BFMC, while balancing optimization error and sampling error. We demonstrate that
the iterates generated by ASTRO-BFDF converge to the first-order stationary point almost
surely. Additionally, we numerically demonstrate the superiority of our proposed algorithm
by testing it on synthetic problems and simulation optimization problems with discrete event
simulations.
1 Introduction
We consider the simulation optimization problem
Z
min f h (x) := E[F h (x, ξ)] = F h (x, ξ) P (dξ), (1)
x∈ℜd Ξ
1
1.1 Bi-fidelity Derivative-free Stochastic Optimization
The iterative algorithms designed to solve problem (1) typically produce a random sequence
{Xk , k ≥ 1}. In the context of SO, these algorithms generate the sequence by determining
both the direction and the step size. Given that direct gradient information is not available from
the simulation oracle, we rely on approximation techniques like a finite difference method [1],
interpolation/regression models [2, 3, 4], and Gaussian smoothing [5] to determine the direction.
These approximation methods are based on function estimates, which are obtained by repeatedly
invoking the stochastic simulation oracle, as shown below:
n
1X h
F̄ h (x, n) = F (x, ξi ), (2)
n
i=1
h 2 −1
Pn h h
2
with a variance estimate (σ̂ ) (x, n) := n i=1 F (x, ξi ) − F̄ (x, n) . To obtain the conver-
gence results in some probabilistic senses, the algorithms must have sufficiently large sample sizes
for each design point during the optimization process. Therefore, it is a logical step to aim for
reducing the total number of simulation replications during the optimization process while still
achieving convergence, as this is typically the main source of computational load. In line with
these efforts, one strategy involves leveraging a LF simulation oracle F l (·, ξ), which is less costly
than the original high-fidelity (HF) simulation oracle F h (·, ξ), whenever possible throughout the
optimization process. This particular method of optimization falls under the category of bi-fidelity
stochastic optimization [6, 7, 8, 9].
High-fidelity function
15
High-fidelity function with stochastic noise
Low-fidelity function
Objective Function Value
10
0.0 0.2 0.4 0.6 0.8 1.0
x
Figure 1: An illustration of bi-fidelity functions. The black and red curves represent the true
objective functions of the HF and LF versions, respectively. Meanwhile, the blue and orange
curves illustrate a single sample path of the stochastic objective functions for the HF and LF
versions, respectively.
2
a near-optimal solution of the HF function, even though the LF function is solely utilized, as
illustrated in Figure 1. However, the specifics of how and when to utilize the LF simulation oracle
have remained elusive, prompting us to pose two overarching questions.
Q1. When is it appropriate to utilize the LF simulation oracle, and when should it not be used
during optimization?
Q2. What number of sample sizes for HF and LF simulation oracles are necessary to attain
sufficiently accurate function estimates for optimization?
In this paper, we propose a sample-efficient solver for bi-fidelity stochastic optimization, aiming
to address questions Q1 and Q2 specifically. We begin by introducing relevant existing sampling
methods that have been used for sample-efficient uncertainty quantification, regardless of their
purpose for optimization.
3
available, the HF function estimate is obtained using both LF and HF simulation oracles hoping
that it can reduce the variance of the HF function estimate [13]. A bi-fidelity Monte Carlo (BFMC)
is then F̄ bf (x, n, v, c) = F̄ h (x, n) − c(F̄ l (x, n) − F̄ l (x, v)) for any c ∈ ℜ, which is an unbiased
estimator of the expectation of the HF function. Here, the independence or dependence between
F̄ l (x, v) and F̄ h (x, n), as well as F̄ l (x, n),P
hinges on how the random variable ξ is managed. For
−1 v l l
instance, consider (3), which employs v i=1 F (x, ξi ) for F̄ (x, v):
n n v
!
bf 1X h 1X l 1X l
F̄ (x, n, v, c) = F (x, ξi ) − c F (x, ξi ) − F (x, ξi ) . (3)
n n v
i=1 i=1 i=1
c2 (Var(F̄ l (x, n)) + Var(F̄ l (x, v))) − 2cCov(F̄ h (x, n), F̄ l (x, n))
(4)
+ 2cCov(F̄ h (x, n), F̄ l (x, v)) − 2c2 Cov(F̄ l (x, n), F̄ l (x, v)) + Var(F̄ h (x, n)).
Therefore, variance reduction becomes feasible with appropriate covariances and variances for
certain values of n, v, and c.
In our proposed algorithm, we have developed an innovative AS strategy, referred as bi-fidelity
Adaptive Sampling (BAS), that leverages both LF and HF oracles. Our approach dynamically
employs BFMC and CMC, guided by estimates of covariances and variances for the functions.
While we expect that BAS can be deployed within a broad range of iterative solvers, we focus
exclusively on stochastic Trust Region (TR) algorithms to solve (1).
(a) (model construction) a local model is constructed to approximate the objective function f
by utilizing specific design points and their function estimates within a designated area of
confidence, i.e., TR, typically defined as an L2 region with a radius of ∆k centered around
the current iterate Xk ;
(c) (candidate evaluation) the objective function at Xks is estimated by querying the oracle, and
depending on this evaluation, Xks is either accepted or rejected; and
(d) (TR management) if Xks is accepted, it becomes the subsequent iterate Xk+1 , and the TR
radius ∆k is either enlarged or remains unchanged; conversely, if Xks is rejected, Xk retains
its position as the next iterate Xk+1 , and ∆k decreases to facilitate the construction of a
more accurate local model to approximate the objective function f .
As described in Section 3.2, our proposed algorithm performs the aforementioned four steps
multiple times within a single iteration to address Q1, utilizing both HF and LF simulation oracles.
4
Specifically, when the correlation between the LF and HF function is expected to be high, which is
determined by the optimization history, the local models will be constructed for the LF function.
If we fail to find a better solution using the local model for the LF function, the local model will
be constructed for the HF function within the same iteration.
(a) We propose a novel stochastic TR method with adaptive sampling tailored specifically for
stochastic bi-fidelity optimization problems, aptly named ASTRO-BFDF. Addressing Q1,
ASTRO-BFDF integrates two separate TRs to handle HF and LF functions, along with a
novel concept termed an adaptive correlation constant. This constant dynamically assesses
the need for constructing a local model for the LF function, with its value evolving in
response to the historical data gathered during the optimization process.
(b) To provide an answer for Q2, we suggest a new adaptive sampling algorithm utilizing both HF
and LF simulation oracles, named BAS. The following three critical decisions are dynamically
made in BAS while replicating function evaluations.
(c) We prove the almost sure convergence, i.e., limk→∞ ∥∇f h (Xk )∥ = 0 w.p.1, of ASTRO-
BFDF. The analysis revolves around two key points. Firstly, when the candidate solution
from the local model for the LF function is accepted, it must ensure a sufficient reduction
in the HF function. Secondly, the estimates for stochastic errors obtained with BFMC
should be sufficiently smaller than the optimality error. Together, these aspects enable the
algorithm to find a better solution for the objective function with reduced computational
effort.
(d) The performance of ASTRO-BFDF has been evaluated using test problems from the SimOpt
library [14]. We started with synthetic problems, created by adding artificial stochastic
noise to deterministic functions. For more realistic numerical experiments, we also tested
simulation optimization problems involving discrete event simulation. Our findings not only
highlight the superior performance of ASTRO-BFDF but also explore various scenarios in
which using the LF function in optimization is beneficial or not.
2 Preliminaries
In this section, we provide key definitions, standing assumptions, and some useful results that will
be invoked in the convergence analysis of the proposed algorithm.
2.1 Notation
We use bold font for vectors; x = (x1 , x2 , · · · , xd ) ∈ ℜd denotes a d-dimensional vector. Calli-
graphic fonts represent sets, and sans serif fonts denote matrices. Our default norm ∥ · ∥ is the L2
5
norm. The closed ball of radius ∆ > 0 centered at x0 is B(x0 ; ∆) = {x ∈ ℜd : ∥x − x0 ∥ ≤ ∆}.
For a sequence of sets An , An i.o. denotes lim supn→∞ An . We write f (x) = O(g(x)) if there are
positive constants ε and m such that |f (x)| ≤ mg(x) for all x with 0 < ∥x∥ < ε. Capital letters
w.p.1
denote random scalars and vectors. For a sequence of random vectors Xk , k ∈ N, Xk −−−→ X de-
notes almost sure convergence. “iid” means independent and identically distributed, and “w.p.1”
means with probability 1. The superscripts h and l indicate that the terminology is related to
high-fidelity and low-fidelity simulations, respectively. The terms σ̂ h (x, n) and σ̂ l (x, n) are the
standard deviation estimates of HF and LF functions at x with sample size n, while σ̂ h,l (x, n) is
the covariance estimate between them.
where
The matrix M(Φ, Xk ) is nonsingular if the set Xk is poised in B(Xk ; ∆qk ). A set Xk is Λ−poised
in B(Xk ; ∆qk ) if Λ ≥ maxi=0,...,p maxz∈B(Xk ;∆q ) |li (z)|,where li (z) are the Lagrange polynomials.
k
If
Pp there exists a solution to (5), then the function Mkq : B(Xk ; ∆qk ) → ℜ, defined as Mkq (x) =
q
j=0 νk,j ϕj (x) is a stochastic polynomial interpolation of estimated values of f q on B(Xk ; ∆qk ).
h i⊺
q q q
In particular, if Gqk := νk,1 νk,2 · · · νk,d and Hqk is a symmetric d × d matrix with elements
q q q
uniquely defined by (νk,d+1 , νk,d+2 , . . . , νk,p ), then we can define the stochastic quadratic model
Mkq : B(Xk ; ∆qk ) → ℜ, as
1
Mkq (x) = νk,0
q
+ (x − Xk )⊺ Gqk + (x − Xk )⊺ Hqk (x − Xk ). (6)
2
Definition 2 (stochastic fully linear models). Given x ∈ ℜd , ∆q > 0 and q ∈ {h, l}, a function
M q : B(x; ∆q ) → ℜ is a stochastic fully linear model of f q on B(X; ∆q ) if ∇f q is Lipschitz
continuous with constant κL , and there exist positive constants κeg and κef dependent on κL but
independent of x and ∆q such that almost surely
∥∇f q (x) − ∇M q (x)∥ ≤ κeg ∆q and |f q (x) − M q (x)| ≤ κef (∆q )2 ∀x ∈ B(x; ∆q ).
Definition 3 (Cauchy reduction). Given Xk ∈ ℜd , ∆qk > 0, q ∈ {h, l}, and a function Mkq :
B(Xk ; ∆qk ) → ℜ obtained following Definition 1, Skc is called the Cauchy step if
∥∇M q (Xk )∥
q q c 1 q q
M (Xk ) − M (Xk + Sk ) ≥ ∥∇M (Xk )∥ min ,∆ .
2 ∥∇2 M q (Xk )∥ k
6
When ∥∇2 Mkq (Xk )∥ = 0, we assume ∥∇M q (Xk )∥/∥∇2 M q (Xk )∥ = +∞. The Cauchy step is
derived by minimizing the model Mkq (·) along the steepest descent direction within B(Xk ; ∆qk ),
making it easy and quick to compute.
Definition 4 (filtration and stopping time). A filtration {Fk }k≥1 on a probability space (Ω, P, F)
is a sequence of σ-algebras, each contained within the next, such that for all k, Fk is a subset of
Fk+1 , and all are subsets of F. A function N : Ω → {0, 1, 2, . . . , ∞} is referred to as a stopping
time with respect to the filtration F if the set {ω ∈ Ω : N (ω) = n} is an element of F for every
n < ∞.
Assumption 1 (function). The HF function f h and the LF function f l are twice continuously
differentiable in an open domain Ω, ∇f h and ∇f l are Lipschitz continuous in Ω with constant
κLg > 0, and ∇2 f h and ∇2 f l are Lipschitz continuous in Ω with constant κL > 0.
We make the next assumption on the higher moments of the stochastic noise resembling the
Bernstein condition. Random variables fulfilling Assumptions 2 exhibit a subexponential tail
behavior.
Assumption 2 (stochastic noise). The Monte Carlo oracles generate iid random variables F q (Xki , ξj ) =
i,q i,q
f q (Xki ) + Ek,j with Ek,j ∈ Fk,j for i ∈ {0, 1, 2, . . . , p, s} and q ∈ {h, l}, where Xks is the candidate
iterate at iteration k and Fk := Fk,0 ⊂ Fk,1 ⊂ · · · ⊂ Fk+1 for all k. Then the stochastic errors
i,q i,q
Ek,j are independent of Fk−1 , E[Ek,j | Fk,j−1 ] = 0, and there exists (σ q )2 > 0 and bq > 0 such
that for a fixed n,
n
1X i,q m m! q m−2 q 2
E[|Ek,j | | Fk,j−1 ] ≤ (b ) (σ ) , ∀m = 2, 3, · · · , ∀k
n 2
j=1
Theorem 1 (Stochastic noise [15]). Let cf > 0 and ∆qk > 0 be given and Ek,j i,q
denotes the
i
stochastic noise following Assumption 2 for q ∈ {h, l}. If the sample size N (Xk ) is determined
by, for any σ0 > 0 and k ∈ N,
where λk = O(log k) and SV (Xki , n) is a finite random variable that depends on sample size n
and the design point Xki , we obtain
7
∞ N (Xki )
X 1 X i,q q 2
P Ek,j ≥ cf (∆ k < ∞.
)
N (Xki )
k=1 j=1
Despite the fact that Theorem 1 has been proven with SV (Xki , n) = σ̂ h (Xki , n) in Section 3.2
of [15], it can also be trivially established with a finite random variable SV (Xki , n) by employing
the same logical framework. The next result provides an upper bound for the gradient error norm
at any design point within the TR when a stochastic linear or quadratic interpolation model is
used. Combined with Theorem 1, it indicates that the gradient error norm will be bounded by
the order of the TR radius after sufficiently many iterations.
Lemma 1 (Stochastic Interpolation Model [11]). If Mkq (z) is a stochastic linear interpolation
model or a stochastic quadratic interpolation model of f q with the design set Xk := {Xki }pi=0 ⊂
B(Xk ; ∆qk ) and corresponding function estimates F̄ q (Xki , N (Xki )) = f q (Xki ) + Ēki,q (Nki ) for q ∈
{h, l}, there exist positive constants κeg1 and κeg2 such that for any z ∈ B(Xk ; ∆qk ),
qP
p i,q i 0,q 0
q q q i=1 (Ēk (Nk ) − Ēk (Nk ))
∥∇M (z) − ∇f (z)∥ ≤ κeg1 ∆ + κeg2 , (8)
∆q
PN (Xki )
where Ēki,q (Nki ) = N (Xki )−1 j=1
i,q
Ek,j .
Lastly, we present the variance of BFMC estimator. To make sure that the variance of BFMC
is reduced, the second and third terms of the RHS of (9) should be less then zero for some
n, v, and c. The another important thing we should notice is that σ h (x), σ l (x), and σ h,l (x) are
usually unknown in reality, forcing us to use the estimates such as σ̂ h (x, n), σ̂ l (x, max{n, v}), and
σ̂ h,l (x, min{n, v}).
Lemma 2 (Variance of BFMC [13]). Let x ∈ ℜd , n, v ∈ N, and c ∈ R. Then the varaince of the
BFMC estimator F̄ bf (x, n, v, c) is
(σ h (x))2
bf 2 1 1 1
Var(F̄ (x, n, v, c)) = +c + −2 (σ l (x))2
n n v max{n, v}
(9)
1 1 h,l
+ 2c − σ (x),
max{n, v} n
Remark 1. When v is less than or equal to n, Var(F̄ bf (x, n, v, c)) is bigger than or equal to
Var(F̄ h (x, n)). Hence, a variance reduction is only available when v is bigger than n.
8
(1) The sample sizes are intricately managed through adaptive sampling using BFMC or CMC.
Within this approach, two critical decisions are made. Initially, as the samples stream in, the
method discerns between employing CMC or BFMC. Secondly, it dynamically adjusts the
sample sizes for both HF and LF simulation oracles, alongside determining the coefficient c
in (3), all in real-time as the sample sizes increase.
(2) At each iteration k, two local models can be constructed using HF and LF simulation oracles,
respectively, each with its own TR: ∆lk for the LF function and ∆hk for the HF function.
The local model utilizing the LF oracle serves two purposes: 1) identifying the candidate
solution for the next iterate, and 2) updating the adaptive correlation constant determining
the utilization of the local model for the LF function in Algorithm 3.
We first introduce the bi-fidelity adaptive sampling (BAS) strategy, which corresponds to the
first feature.
σ̂ h (x, n) κ∆2k
p p
N (x) = min n ∈ N : √ ≤ √ (10)
np λk
4: loop
5: Approximately compute C ∗ , N ∗ , and V ∗ by solving the problem (11) and set c = C ∗ .
6: if wh N ∗ + wl V ∗ ≤ wh N p then
7: Set v = max{n + 1, v} and update σ̂ h,l (x, n) and σ̂ l (x, v) by calling the LF oracle.
8: if Var(F̄ bf (x, n, v, c)) ≤ κ2 ∆4k λ−1
k then
9: return [n, v, c, F̄ bf (x, n, v, c)] (BFMC)
10: end if
11: if n ≥ N ∗ − 1 then
12: Set v = v + sl and get sl additional replications of the LF oracle and update σ el (x, v).
13: else
14: Set n = n + sh and update σ̂ h (x, n) and σ̂ h,l (x, n) by calling the LF and HF oracles.
15: end if
16: else
17: if n ≥ N p (x) then
18: return [n, v, c, F̄ h (x, n)] (CMC)
19: end if
20: Set n = n + sh and update σ̂ h (x, n) and N p (x) by calling the HF oracle.
21: end if
22: end loop
9
3.1 Adaptive Sampling for Bi-Fidelity Stochastic Optimization
While BFMC has the capability to reduce the variance of the function estimate, blindly employing
BFMC may not always be advantageous. For example, when the expense of invoking the LF
simulation oracle is comparable or marginally lower than that of the HF one, and the inherent
variance of the LF simulation significantly exceeds that of the HF simulation, BFMC should
be avoided. Therefore, it is essential to decide which Monte Carlo method to employ at a given
design point based on the variance of the LF and HF simulation output and the covariance between
them. However, a challenge is that the true variances of the LF and HF simulation output are
unknown. Therefore, when adaptive sampling is utilized, the choice of the MC method needs to
be dynamically determined based on variance and covariance estimates, which are sequentially
updated using the simulation results. In summary, we should dynamically determine N, V , and C
while streaming the simulation replications, where N and V are the sample sizes for the HF and
LF oracle and C represents the coefficient in the BFMC estimate, which is denoted as c in (3).
To achieve this for any design point x at iteration k, we suggest BAS, as listed in Algorithm 1.
Algorithm 1 starts by sampling n number of the HF oracle and the LF oracle to estimate
the variance and covariance terms in (9). By leveraging the variance estimate σ̂ h (x, n), we can
derive a predicted minimum sample sizes N p (x) for CMC, adhering to the adaptive sampling
rule (7). Then the predicted computational cost of CMC at x is represented as wh N p (x), where
wh is the cost of calling HF oracle once. Our next step involves juxtaposing this against the
projected computational costs of BFMC. To predict the lowest costs with corresponding sample
sizes for BFMC, we solve (11) with variance estimates σ̂ h (x, n), σ̂ l (x, max{v, n}), and σ̂ h,l (x, n)
for Var(F̄ bf (x, n
e, ve, e
c)).
[N ∗ , V ∗ , C ∗ ] ∈ argmin wh n
e + wl ve
n
e,e c∈ℜ
v ,e
where wl is the cost for calling LF oracle once. The first constraint is the adaptive sampling
rule which originated from (7), ensuring that BFMC achieves the required accuracy. We are now
poised to contrast the predicted computational costs between the crude MC and BFMC.
If wh N p ≤ wh N ∗ +wl V ∗ , employing the CMC method would be more cost-effective in achieving
the required accuracy of estimates at the current iteration, given the information available. Hence,
if n ≥ N p , the accuracy of the function estimate is already sufficient to proceed the optimization
and thus, the algorithm returns the function estimate with CMC (F̄ (x, n)). If not, we update
σ̂ h (x, n = n + sh ) by calling sh additional replications of the HF oracle. Then we proceed to Step
5 and continue with the algorithm using the newly updated n and variance estimates. It is worth
to emphasize that the variance estimate for the LF function and the covariance estimate are not
updated.
Following Step 5, if the cost of BFMC is lower than that of CMC, BFMC might offer greater
cost-effectiveness, prompting the algorithm to proceed to Step 6. We first set v = max{v, n + 1}
with the updated n in Step 20 to ensure that v > n (See remark 1). Following this adjustment,
additional replications of the LF oracle may become necessary due to the updated value of v.
Consequently, it is imperative to update both σ̂ l (x, v) and σ̂ h,l (x, n) to reflect this change. Sub-
10
sequently, if the variance of the function estimate by BFMC is small enough compared to the
optimality measure, i.e.,
Var(F̄ bf (x, n, v, c)) ≤ κ2 ∆4k λ−1
k , (12)
the algorithm returns F̄ bf (x, n, v, c). If not, we need to decide whether to increase n or v. As
evident from (9), n has a more pronounced impact on the left-hand side of (12) compared to v.
Therefore, our initial assessment focuses on determining whether we need to increase n or not. If
n < N ∗ − 1, we set n = n + sh and update σ̂ h (x, n) and σ̂ h,l (x, n) by obtaining the necessary
number of additional LF and HF simulation results. Subsequently, we move on to Step 5 and
proceed with the algorithm utilizing the newly updated n. If n ≥ N ∗ − 1, we acquire only sl
additional replications from the LF oracle and update σ̂ l (x, v). Following this, we advance to
Step 5 and continue the algorithm with the newly updated v.
Since n, v, and c are determined dynamically based on the realization of simulation results in
Algorithm 1, three outputs are the stopping times determined by the filtration. Hence, we refer
the output of Algorithm 1 as [Nk (x), Vk (x), Ck (x), Fek (x)].
Remark 2. While employing CMC, the combined replications from the LF oracle can be reused
for constructing the local model using the LF oracle, detailed in Section 3.2.
11
Algorithm 2 ASTRO-BFDF
Input: Initial incumbent x0 ∈ ℜd , initial and maximum TR radius ∆l0 , ∆h0 , ∆max > 0, model
fitness thresholds 0 < η < 1 and certification threshold µ > 0, expansion and shrinkage
constants γ1 > 1 and γ2 ∈ (0, 1), sample size lower bound sequence {λk } = {O(log k)},
adaptive sampling constant κ > 0, correlation constant αk > 0, and lower bound of an initial
variance approximation σ0l > 0.
1: for k = 0, 1, 2, . . . do
2: Obtain Ikh , Xks , ∆lk , and ∆hk by calling Algorithm 3.
3: if Ikh is True then
4: Select Xk = {Xki }2d h
i=0 ⊂ B(Xk ; ∆k ).
5: Estimate the HF function at {Xki }2d i=0 by calling Algorithm 1.
6: Estimate the LF function F̄ l (Xki , Tki ) at {Xki }2d
i=0 , satisfying
max{σ0l , σ̂ l Xki , t }
κ(∆hk )2
i
Tk = min t ∈ N : √ ≤ √ . (13)
t λk
14: else
15: Set (Xk+1 , ∆lk+1 ) = (Xks , γ1 ∆lk ), αk = min{γ1 αk , 1} and k = k + 1.
16: end if
17: end for
12
Algorithm 3 [Ikh , Xks , ∆lk , ∆hk ] = ASTRO-LFDF(Xk )
Input: Xk , ∆lk , model fitness thresholds 0 < η < 1 and certification threshold µ > 0, sufficient
reduction constant θ > 0, expansion and shrinkage constants γ1 > 1 and γ2 ∈ (0, 1), sample
size lower bound sequence {λk } = {O(log k)}, adaptive sampling constant κ > 0, correlation
constant αk > 0, correlation threshold αth > 0, lower bound of an initial variance approxima-
tion σ0l > 0, sufficient reduction constant ζ > 0, and gradient norm of the model lower bound
ϵ̂ > 0.
1: loop
2: if αk < αth then
3: Set Ikh = True and Xks,l = Xk
4: break
5: end if
6: Select Xkl = {Xki }pi=0 ⊂ B(Xk ; ∆lk ).
7: Estimate F̄ l (Xki , Tki ) at {Xki }2d l h
i=0 , satisfying (13) with ∆k instead ∆k .
8: Construct local model Mkl (X).
9: Approximately compute the local model minimizer
10: Estimate Fek (Xks,l ) and Fek (Xk0 ) by calling Algorithm 1 with ∆k = ∆lk .
11: Compute the success ratio ρ̂k as
increases. Additionally, ∆hk is adjusted to be larger than ∆lk at Step 18 in Algorithm 3, and we
proceed to the next iteration k + 1. If not, the candidate is rejected, leading to the contraction of
∆lk , a decrease in αk , and progression to Step 6 in Algorithm 3 to identify a superior candidate
within the shrunken TR. This process continues until the algorithm concludes that the LF oracle
cannot contribute to discovering a better solution, indicated by αk < αth .
Remark 3. In Algorithm 3, a sufficient reduction test (Step 11 and 12) is different with the one
in Algorithm 2. Firstly, for a successful iteration, the reduction in function estimates must be
larger than ζ(∆hk )2 for some ζ > 0 (See Step 11). Secondly, the norm of the local model gradient
should be bigger than ϵ̂ (See Step 12). These conditions prevent us from accepting a candidate due
to a very small reduction in the local model value in (14). Additionally, they ensure the almost
13
sure convergence of ASTRO-BFDF (See the proof of Theorem 2 in Section 4).
When Algorithm 3 fails to identify the next iterate, we construct the local model of the HF
function (Mkh ) in Algorithm 2. To select the design set Xk , we aim to reuse as many previously
visited design points from past iterations or those used while constructing Mkl in the current iter-
ation as possible. Subsequently, we estimate the value of the HF function by invoking Algorithm
1. This yields the HF function estimate Fe(Xki ) and the LF function estimate F̄ l (Xik , Vki ) for any
i ∈ {0, 1, 2, . . . , |Xk |}. Then, we can additionally derive estimates for the LF function (F̄ l (Xik , Tki )),
aligning with the adaptive sampling rule (13). It is worth noting that when the estimates for the
HF function are acquired through BFMC, Vki from Algorithm 1 inherently adhere to (13), i.e.,
Vki ≥ Tki , with a high probability. With Xk and corresponding HF and LF function estimates,
two distinct local models for the HF and LF functions are constructed, and two minimizers, Xks,l
and Xks,h , are derived, with one stemming from Mkl and the other from Mkh . We evaluate the HF
function values at two potential candidate points and designate the one with the lower objective
value as the candidate point to go forward with. Leveraging this candidate, we update the next
iterate and ∆hk . Finally, the adaptive correlation constant is adjusted based on the results of the
sufficient reduction test at Xks,l .
In Algorithm 2, the creation of the local model for the LF function occurs at various points,
each serving distinct purposes. Specifically, within Algorithm 3, which operates as the inner loop
within Algorithm 2, we construct this model to seek an improved solution for the HF function.
This decision stems from the belief that the LF function shares analogous gradient and curvature
information, deduced from the insights gathered from previous iterations, denoted by αk > αth .
Thus, the utilization of the HF oracle is minimized, being employed only for the sufficient reduction
test. In the outer loop, the primary objective is to update the adaptive correlation constant, even
in cases where the LF function has not proven beneficial in preceding iterations. In this case,
our aim is to minimize reliance on the LF function, a goal achievable through the adoption
of BFMC. When the HF function values are estimated at Step 5 in ASTRO-BFDF, an ample
number of independent replications of the LF oracle are already obtained by BFMC. This enables
the construction of Mkl without incurring any additional computational burden. Unfortunately,
in scenarios where the LF function fails to contribute meaningfully to the optimization process,
Algorithm 2 may consume more resources compared to alternative solvers that exclusively utilize
the HF function. However, discerning the utility of the LF function necessitates an additional
computational budget, albeit the impact may be marginal in practice, as we will elaborate on in
Section 5.
4 Convergence Analysis
In this section, we delve into demonstrating the convergence of ASTRO-BFDF. We first introduce
two additional assumptions concerning the local model. Firstly, we stipulate that the minimizer of
the local model must yield a certain degree of function reduction, known as the Cauchy reduction
(See Definition 3). Secondly, the Hessian of the local model should be uniformly bounded. Both
of these assumptions are essential to validate the quality of the candidate point for any given
iteration k.
Assumption 3 (Reduction in Subproblem). For some
q 0 q s,q q 0 q 0 c
κf cd ∈ c (0, 1], q ∈ {h, l}, and all k,
Mk (Xk ) − Mk (Xk ) ≥ κf cd Mk (Xk ) − Mk (Xk + Sk ) , where Sk is the Cauchy step.
Assumption 4 (Bounded Hessian in Norm). In ASTRO-BFDF, the local model Hessians Hqk are
bounded by κqH for all k and q ∈ {h, l} with κqH ∈ (0, ∞) almost surely.
14
The convergence analysis of the adaptive sampling stochastic TR method for derivative-free op-
timization has received considerable attention in prior works such as [11, 15]. While our approach
to proving the convergence shares similarities with those, there are two crucial considerations we
must address.
(a) (Stochastic noise) BFMC ought to yield results consistent with those of Theorem 1, indicat-
ing that the stochastic error in BFMC will indeed be less than O((∆qk )2 ) for q ∈ {h, l} after
sufficiently large k. To achieve this, a crucial prerequisite (See Assumption 2) is ensuring
that F h (x, ξ) − cF l (x, ξ) exhibits similar properties to F h (x, ξ) for any ξ ∈ Ξ, c > 0, and
x ∈ ℜd .
(b) (Trust-region) The TR sizes for both HF and LF functions need to converge to zero. In
the context of convergence theory in the most stochastic TR methods [2, 4, 11], it becomes
imperative to demonstrate that the TR radius converges to zero as k approaches +∞ in
some probabilistic senses. This necessity arises because function estimate errors typically
remain bounded by the order of the TR size, given specific sampling rules and assumptions.
Consequently, the estimation errors will also converge to zero, ensuring the accuracy of the
estimates. Therefore, within bi-fidelity stochastic optimization, we also need the same result
for ∆hk . Furthermore, since ∆hk ≥ ∆lk for all k ∈ N, the convergence of ∆hk implies the
convergence of ∆lk as well.
Taking into account the aforementioned crucial considerations, we are now poised to present
the convergence theory of ASTRO-BFDF.
Theorem 2 (Almost Sure Convergence). Let Assumptions 1-4 hold. Then,
w.p.1
lim ∥∇f h (Xk )∥ −−−→ 0. (15)
k→∞
Theorem 2 guarantees that a sequence {Xk (ω)} generated by Algorithm 2 converges to the
first-order stationary point for any sample path ω ∈ Ω.
i,h i,l
Proof of Theorem 2 We start by demonstrating that the iid random variables Ek,j − cEk,j also
fulfill Assumption 2 for any k ∈ N, c ∈ R, and i ∈ {0, 1, 2, . . . , p, s}, indicating their adherence to
the sub-exponential distribution.
Lemma 3. Let Assumption 2 holds. Then there exist σ 2 > 0 and b > 0 such that for a fixed n
and c ∈ ℜ,
n
1X i,h i,l m m! m−2 2
E[|Ek,j − cEk,j | | Fk,j−1 ] ≤ b σ , ∀m = 2, 3, · · · , ∀k. (16)
n 2
j=1
Proof. We obtain from the Minkowski inequality and Assumption 2 that for a any k, j ∈ N, c ∈ R,
and any m ∈ {2, 3, · · · }, there exist bh , bl , (σ h )2 , (σ l )2 > 0 such that
1 1
m
i,h i,l m i,h m i,l m
E[|Ek,j − cEk,j | | Fk,j−1 ] ≤ E[|Ek,j | | Fk,j−1 ] m + E[c|Ek,j | | Fk,j−1 ] m
m! h m−2 h 2 1 m! l m−2 l 2 1 m
(17)
≤ ( (b ) (σ ) ) m + ( (b ) (σ ) ) m .
2 2
Without loss of generality, let us assume that σ h > σ l > 0 and bh > bl . Then there must exist
some constant ασ , αb > 1 such that ασ (σ l )2 = (σ h )2 and αb bl = bh . Then the right-hand side of
15
(17) becomes ((ασ2 αbm−2 )1/m + 1)m (2−1 m!(bl )m−2 (σ l )2 ). Since (ασ2 αbm−2 )1/m + 1 ≤ ασ αb + 1 for all
m ∈ {2, 3, · · · }, we obtain
n
1X i,h i,l m m!
E[|Ek,j − cEk,j | | Fk,j−1 ] ≤ ((ασ αb + 1)bl )m−2 ((ασ αb + 1)σ l )2 .
n 2
j=1
Hence, the statement of the theorem holds with σ = σ l (ασ αb + 1) and b = (ασ αb + 1)bl .
Now let us prove that the function estimate error from BAS is bounded by O((∆hk )2 ), aligning
with the outcome stated in Theorem 1. This finding not only enables us to attain the stochastic
fully linear model (See Definition 2) but also leads to the crucial observation that ∆hk converges
to 0 almost surely as k tends to infinity.
Lemma 4. Let Assumption 2 holds and Xki for i ∈ {0, 1, 2, . . . , p, s} be the design points generated
e i be the HF function estimate obtained
by Algorithm 2 at iteration k. Let Fe(Xki ) = f (Xki ) + Ek
q
from Algorithm 1 with ∆k = ∆k for q ∈ {h, l}. Then, given cf > 0,
e i | ≥ cf (∆q )2 i.o.} = 0.
P{|E (18)
k k
Proof. Let ω ∈ Ω. Firstly, if the function estimate from BAS was obtained by CMC, we know
from Theorem 1 that the statement of the lemma is satisfied. Now, we assume that the function
estimate Fe(Xki ) is obtained using BFMC, implying that
e i (ω)| = |Ē i,h (N i (ω)) − Ck (ω)Ē i,l (N i (ω)) + Ck (ω)Ē i,l (V i (ω))|,
|Ek k k k k k k
−1 PN (Xki ) i,q
where Nki = Nk (Xki ), Vki = Vk (Xki ), and Ēki,q (Nki ) = N (Xki ) j=1 Ek,j for q ∈ {h, l}. To
simplify notation, we will omit ω from this point forward. Then we have
e i | ≥ cf (∆q )2 |Ck = c} ≤ P{|Ē i,h (N i ) − cĒ i,l (N i )| ≥ cf q 2 cf
P{|Ek k k k k k (∆k ) } + P{|cĒki,l (Vki )| ≥ (∆qk )2 }.
2 2
(19)
We know from Step 1 and 8 in Algorithm 1, Lemma 2, and Nki < Vki that we obtain Nki , Vki , and
c such that
q
max{σ0 , (σ̂ h (Xki , Nki ))2 + c2 (σ̂ l (Xki , Vki ))2 − 2cσ̂ h,l (Xki , Nki )} κ(∆q )2
q ≤ √k ,
Nki λk
and q
max{σ0 , 2cσ̂ h,l (Xki , Nki ) − c2 (σ̂ l (Xki , Vki ))2 } κ(∆qk )2
q ≤ √ ,
Vki λk
i,h i,l
for some σ0 > 0. We also know from Lemma 3 that Assumption 2 holds for Ek,0 − cEk,0 , implying
i,h i,l
that Theorem 1 holds for Ek,0 − cEk,0 . Hence, the right-hand side of (19) is summable, from
i q 2 e i | ≥ cf (∆q )2 } = E[P{|E
ei | ≥
which we obtain P{|Ek | ≥ cf (∆k ) } is also summable based on P{|E
e
k k k
q 2
cf (∆k ) |Ck = c}]. As a result, the statement of the theorem holds.
16
Next, we demonstrate that as k goes to infinity, both TR radii inevitably converge to zero
almost surely. Despite the main framework of our proof differing trivially from the one presented
in [4], we opt to provide a comprehensive proof to facilitate understanding in Appendix A.
Relying on Lemma 5, we show through Lemma 6 that the gradient of the model for the HF
function converges to a true gradient almost surely. It is worth highlighting that the local model
for the HF function is not constructed at every iteration, as sometimes the local model for the LF
function can discover a better solution.
Lemma 6. Let Assumptions 1-4 hold. Let {kj } be the subsequence such that Ikhj = True. Then,
w.p.1
∥∇Mkhj (Xk0j ) − ∇f (Xk0j )∥ −−−→ 0 as j → ∞.
Proof. We know from Lemma 4 that given cf > 0, there exists sufficiently large J such that
e i | < cf (∆h )2 for any i ∈ {0, 1, · · · , p, s} and j > J. Then from Lemma 1, we have,
|Ekj kj
qP
p ei e 0 )2
i=1 (Ekj −Ekj
∥∇Mkhj (Xk0j ) − ∇f (Xk0j )∥ ≤ κeg1 ∆hkj + κeg2
∆hkj
ei − E
|E e0 |
kj kj
≤ κeg1 ∆hkj + κeg2
∆hkj
≤ (κeg1 + 2κeg2 cf )∆hkj .
Given that Lemma 5 ensures ∆hkj converges to 0 w.p.1, the statement of the theorem holds.
In the following lemma, we demonstrate that after a sufficient number of iterations, if the
TR for the HF function is relatively smaller than the model gradient, the iteration is successful
with probability one. Given Lemma 6, Lemma 7 suggests that in cases where the true gradient
is greater than zero, if the TR radius for the HF function is comparatively smaller than the true
gradient, the candidate solution is accepted and the TR is expanded. This ensures that the TR
for the HF function will not converge to zero before the true gradient does.
Lemma 7. Let Assumptions 1-4 hold. Then there exists cd > 0 such that
n \ o
P ∆hk ≤ cd ∥∇Mkh (Xk0 )∥ (ρ̂k < η) i.o. = 0.
Proof. We first note that for any k ∈ N, when the minimizer of the low fidelity local model in
Algorithm 3 is accepted as a next iterate, i.e., Ikh is False, we already have ρ̂k ≥ η. Otherwise, the
HF local model is constructed in Algorithm 2. Then the rest of the proof trivially follows from
Lemma 4.4 with the adaptive sampling rule (A-0) in [15].
17
Lemma 8. Let Assumptions 1-4 hold. Then
w.p.1
lim inf ∥∇f h (Xk )∥ −−−→ 0 as k → ∞. (20)
Proof. Using Lemma 6 and 7, the proof can be completed by straightforwardly following the steps
outlined in Theorem 4.6 of [4].
We have now reached a point where we can confidently establish the almost sure convergence
of ASTRO-BFDF. The following proof solidifies our claim.
Proof of Theorem 2. We first need to assume that there is a subsequence that has gradients
bounded away from zero for contradiction. Particularly, suppose that there exists a set, D̂, of
positive measure, ω1 ∈ D̂, ϵ0 > 0, and a subsequence of successful iterates, {tj (ω1 )}, such that
∥∇f h (Xtj (ω1 ) (ω1 ))∥ > 2ϵ0 , for all j ∈ N. We denote tj = tj (ω1 ) and suppress ω1 in the following
statements for ease of notation. Due to the lim-inf type of convergence just proved in (20), for
each tj , there exists a first successful iteration, ℓj := ℓ(tj ) > tj , such that, for large enough k,
and
∥∇f h (Xℓj )∥ < 1.5ϵ0 . (22)
Define Ahj := k ∈ H : tj ≤ k < ℓj and Alj := k ∈ L : tj ≤ k < ℓj . Let j be sufficiently large
Since k is a successful iteration, ρ̂k ≥ η. Furthermore, Assumption 3 and (23) then imply that
f h (Xk ) − f h (Xk+1 ) + E
e0 − E
k
e s ≥ η[M h (Xk ) − M h (Xk+1 )]
k k k
( )
1 h ∥∇Mkh (Xk )∥ h
≥ ηκf cd ∥∇Mk (Xk )∥ min , ∆k (24)
2 ∥Hk ∥
> cf d ∆hk ,
where cf d = 12 ηκf cd min{ϵ0 , ϵ̂}. When k ∈ Alj , we also obtain the similar result with (24) using
∥∇Mkl (Xk )∥ > ϵ̂:
f h (Xk ) − f h (Xk+1 ) + E
e0 − E
k
e s > cf d ∆l .
k k (25)
Since we know from Lemma 4 that
e0 − E
|E e s | < 0.5cf d ∆h for k ∈ Ah and |E
e0 − E
e s | < 0.5cf d ∆l for k ∈ Al , (26)
k k k j k k k j
the sequence {f h (Xk )}k∈Aj is monotone decreasing for sufficiently large j. From (25), (26), and
the fact that ∥Xk − Xk+1 ∥ ≤ ∆k for all k, we deduce that
X X X
∥Xtj − Xℓj ∥ ≤ ∥Xi − Xi+1 ∥ ≤ ∆hi + ∆li
i∈Aj i∈Ah i∈Alj
j
(27)
2(f h (Xtj ) − f h (Xℓj ))
≤ .
cf d
18
Now define Cj := {k ∈ K : ℓj ≤ k < tj+1 }. Let k ∈ Cj for sufficiently large j. From (24)-(26), we
then obtain
f h (Xk ) − f h (Xk+1 ) ≥ 0.5cf d (∆lk )2 ,
implying that the sequence {f h (Xk )}k∈Aj ∪Bj is monotone decreasing for sufficiently large j. The
boundedness of f h from below then implies that the right-hand side of (27) converges to 0 as j goes
to infinity, concluding that limj→∞ ∥Xtj − Xℓj ∥ = 0. Consequently, by continuity of the gradient,
we obtain that limj→∞ ∥∇f h (Xtj ) − ∇f h (Xℓj )∥ = 0. However, this contradicts ∥∇f h (Xtj ) −
∇f h (Xℓj )∥ > 0.5ϵ0 , obtained from (21) and (22). Thus, (15) must hold.
5 Numerical Experiments
We will now assess and compare ASTRO-BFDF with other simulation optimization solvers. Our
focus lies on testing across two distinct problem categories: synthetic problems and toy problems
with Discrete Event Simulation (DES).
Synthetic problems constitute deterministic problems embellished with artificial stochastic
Gaussian noise. Given our knowledge of the closed equation of f h , generating numerous problems
that adhere to predetermined assumptions becomes relatively straightforward. However, since
both the function f h and the stochastic noises are artificially generated, the performance of the
solvers on these problems might not be indicative of its efficacy in handling real-world problems.
In particular, when the same random number stream is used, the stochastic noises at different
design points will be identical, implying that F h (·, ξ) − f h (·) is a constant function given fixed
ξ ∈ Ξ. This setting satisfies a stricter assumption than the one posed in this paper. Hence, we
also evaluated the solvers on toy problems utilizing DES to ensure testing with more realistic
scenarios. DES simulates real-world conditions, generating multiple outputs utilized within the
objective function f . All experiments have been implemented using SimOpt [18].
We compare ASTRO-BFDF, ASTRO-DF, ADAM [19], and Nelder-Mead [20]. In implement-
ing the solvers, including ASTRO-BFDF, we applied Common Random Numbers (CRN), which
involves using the same random number stream to reduce variance when comparing function esti-
mates at different design points. To integrate CRN into ASTRO-BFDF, each time a local model
is constructed, the sample sizes and the coefficient at the center point, obtained through BAS, are
preserved and subsequently utilized for estimating the function values at other design points.
19
which the stochastic noises for the HF and LF oracles adhere to the standard Gaussian distri-
bution N (0, chsd ) for chsd ∈ {20, 30, 40} and Gaussian distribution N (0, clsd ) for clsd ∈ {20, 30, 40}
respectively.
Figure 2: Fraction of 108 synthetic problems solved to 0.01-optimality with 95% confidence
intervals from 20 runs of each algorithm shows a clear advantage in finite-time performance of
ASTRO-BFDF.
Solvability profiles are used to compare the solvers, as illustrated in Figure 2. These profiles
provide insights into how well a solver performs by showing the proportion of tested problems it
solves within a certain relative optimality gap. Calculating this gap requires the optimal solution,
which is determined as the best solution among all solvers for a given problem in practice. When
the cost ratio of calling HF and LF oracles stands at 1:0.1, ASTRO-BFDF emerges as a standout
performer, solving over 50% of the problems within a mere 15% of the budget. What is particularly
noteworthy is that even when the costs for the LF function match those for the HF function (cost
ratio 1:1), ASTRO-BFDF demonstrates faster convergence than ASTRO-DF. This suggests that
utilizing the LF function could be beneficial for optimization, even if optimizing it requires a larger
computational budget compared to optimizing the HF function. Hence, we will next delve deeper
into the specific scenarios where leveraging the LF function proves advantageous for optimization.
Usefulness of the LF function Most papers [16, 17] employ the correlation between the LF
and HF functions or the LF and the surrogate model to determine whether employing the LF
oracle can be helpful for the optimization or not. However, the correlation can be varied based
on the region we try to quantify. For instance, even though the LF function may exhibit a high
correlation with the HF function in specific feasible regions, its usefulness can vary based on the
optimization progress and setup, such as the initial design point. Hence, the correlation might not
be suitable metric to determine whether the LF function is helpful for the optimization. Indeed,
instead of requiring high correlation within the entire feasible region, having an accurate gradient
20
at the current iterate is sufficient to find a better solution, utilizing any available information
source. This rationale underpins the use of an adaptive correlation constant in ASTRO-BFDF,
which is updated based on whether the previous gradient estimates from the LF oracle have
improved the solution.
(a) Low correlation (κcor = 0.1) (b) High correlation (κcor = 0.9)
Figure 3: Solvability profiles of 36 problems with 95% confidence intervals from 20 runs of each
algorithm with two different correlation setting between the LF and HF functions.
𝑿∗𝒉 𝑿∗𝒍 ∗𝒉 𝑿0
𝑿
Figure 4: The illustration depicts the scenario where the LF function is convex. Xh∗ and Xl∗
represent the global optima of the HF and LF functions, respectively, while X0 marks the initial
iterate. Depending on the step size, using only the HF function may lead Xk to converge to a
local optimum (X̂h∗ ). However, if the LF function is used until the iterate reaches the green area,
achieving the global optimum becomes possible.
21
The usefulness of the LF function in providing accurate gradient estimates can be maximized
when it possesses unique structural properties, such as convexity, which could aid in locating
the global optimum of the non-convex HF function (See Figure 4). In this case, the bi-fidelity
optimization still remains advantageous despite high variance and costs of the LF oracle. However,
it is important to note that the opposite scenario is also possible, where the optimum of the LF
function is located near the local optimum of the HF function, which is undesirable (See Figure
1).
10
High-Fidelity Objective Function 10
Low-Fidelity Objective Function
30 0
8 24 8 100
6 18 6 200
12 300
4 4
6 400
x1
x1
2 2
0 500
0 0
6 600
2 2 700
12
4 4 800
18
4 2 0 2 4 6 8 10 4 2 0 2 4 6 8 10
x0 x0
Figure 5: The contour maps of the HF and LF function without stochastic noises of the BRANIN
problem with κcor = 1.
The Branin function is an example for which the bi-fidelity optimization is helpful due to the
structure of the LF function. In Figure 5, even though the LF function is non-convex, it possesses
a favorable structure that allows gradient-based methods to find the global optimum more easily
compared to the HF function. Hence, during the optimization, the solver can find the solution
near the global optimum of the HF function by leveraging only the LF function.
Remark 4. The trajectory of ASTRO-BFDF toward a local optimum closely hinges on two critical
factors: the initial design point and the parameter α0 . Take, for instance, the scenario where the
LF function typically pinpoints near the global optimum in the HF function like the BRANIN
problem. Yet, if we kick off ASTRO-BFDF with an initial solution like (7,8), it is quite likely
that ASTRO-BFDF converges towards points nearby, perhaps around (4,10) or (9,10). Moreover,
when α0 < αth , the optimization at iteration 0 leans on the HF function, potentially leading the
iterates to converge to a distinct local optimum compared to the path followed when α0 > αth . In
our numerical experiments, we deliberately set α0 > αth to maximize the computational efficiency.
22
HF model’s behavior over a shorter duration, say 110 days. In this specific instance, the cost ratio
between the HF and LF models stands at 1 : 0.3. A notable aspect of this problem is that running
one replication of the HF model inherently produces one replication of the LF model without
incurring any additional computational expenses.
Figure 6: Fraction of 25 toy problems with DES solved to 0.1-optimality with 95% confidence
intervals from 20 runs of each algorithm. ASTRO-BFDF demonstrates not only a faster con-
vergence but also an enhanced ability to identify superior solutions by the end of the allocated
computational budget.
Before delving into the details of each problem, we provide the solvability profile with 25
instances (See Figure 6), specifically including 5 instances from MM1 and 20 instances from
SSCONT. The cost ratio between the HF and LF models for both problems is 1 : 0.3, indicating
that the LF oracle simulates the system for 0.3T days.
23
HF function HF function
40 LF function LF function
objective function value
30 40
30
20
20
10
10
0 0
1 2 3 4 5 6 1 2 3 4 5 6
mu mu
(a) Independent sampling and λ = 1 (b) CRN and λ = 1
20 20
0 0
1 2 3 4 5 6 1 2 3 4 5 6
mu mu
(c) Independent sampling and λ = 5 (d) CRN and λ = 5
Figure 7: The trajectory of the objective function of the M/M/1 problem with and without
CRN. When employing CRN, the objective function exhibits smoothness, indicating that both
E h (·, ξ) and E l (·, ξ) are smooth functions for any ξ ∈ Ξ.
We have conducted testing on 5 instances of the M/M/1 problem, varying λ across the range
{1, 2, . . . , 5} in Figure 6. Figure 8 illustrates the optimization progress for two scenarios: one
where λ = 1 and another where λ = 5. In the scenario where λ = 1, as the incumbents approach
the optimal solution, it becomes essential for the TR to contract appropriately in order to achieve
an accurate gradient approximation. While contracting the TR, ASTRO-DF exhausts its budget
entirely, which explains its slower convergence in Figure 8a. In contrast, ASTRO-BFDF is capable
of rapidly identifying a near-optimal solution. The primary reason is that the gradient of the local
model for the LF function is inherently small, enabling us to sustain successful iterations before the
TR initiates sequential contraction. Conversely, when λ = 5, the gradient of the local model for the
LF function becomes exceedingly minuscule, prompting a cessation of LF function utilization after
just a few iterations. Hence, in Figure 8b, the optimization trajectory of ASTRO-BFDF appears
similar to that of ASTRO-DF, but ASTRO-BFDF demonstrates slightly faster convergence due
to the variance-reduced function estimates provided by BAS.
24
(a) λ = 1 (b) λ = 5
Figure 8: Fraction of the optimality gap with 95% confidence intervals from 20 runs of each
algorithm.
In the inventory problem, we consider (s, S) inventory model with full backlogging. At each time
step t, the demand Dt , which follows the exponential distribution with µD , is generated. At the
end of each time step, the inventory level is calculated and if it is below s, an order to get back
up to S is placed. Lead times follow the Poisson distribution with mean µL time steps. The
optimization is to find the best s and S for minimizing the average costs, which is composed of
backorder costs, order costs, and holding costs. This problem is significantly more challenging than
the MM1 problem due to the inherent non-smoothness with CRN (see Figure 9). Therefore, it is
highly probable that the majority of incumbent sequences converges to local optima, regardless
of the solvers used.
We conducted tests on 20 instances of the SSCONT problem, with varied parameters µD =
{25, 50, 100, 200, 400} and µL = {1, 3, 6, 9}. In the majority of cases, ASTRO-BFDF demonstrated
at least similar performance to ASTRO-DF, and sometimes surpassed it by uncovering superior
solutions with a smaller budget. This success can be attributed to its ability to avoid getting
trapped in local optima too quickly by leveraging the LF function. Detailed numerical results can
be found in Appendix B.
6 Conclusion
This paper introduces ASTRO-BFDF, a novel stochastic TR algorithm tailored for addressing bi-
fidelity simulation optimization. ASTRO-BFDF stands out for two key features: Firstly, it utilizes
bi-fidelity Monte Carlo or crude Monte Carlo dynamically, adjusting sample sizes adaptively for
both fidelity oracles within BAS. This ensures accurate estimation of function values, with the
accuracy required for both function and gradient determined by the progress of optimization.
Secondly, it strategically guides incumbents towards the neighborhood of the stationary point
of the HF function by solely utilizing the LF function. These two features allow to achieve a
faster convergence with enhanced computational efficiency, as demonstrated on several problems
25
600
High-Fidelity Objective Function 600
Low-Fidelity Objective Function 732
660
714
550 648 550
696
500 636 500 678
624 660
450 450
s
s
612 642
400 600 400 624
588 606
350 350
576 588
300 564 300 570
300 350 400 450 500 550 600 300 350 400 450 500 550 600
S S
Figure 9: The contour maps of the HF and LF function of the SSCONT problem with CRN.
The HF simulator operates for 100 days, while the LF simulator runs for 30 days, indicating a
cost ratio of 1:0.3.
including the synthetic problems and toy problems with DES. We also demonstrate the asymptotic
behavior of the incumbents generated by ASTRO-BFDF, which converges to the stationary point
almost surely.
Acknowledgments
This work was authored by the National Renewable Energy Laboratory, operated by Alliance
for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No.
DE-AC36-08GO28308. Funding for the algorithmic development and numerical experiment work
was provided by Laboratory Directed Research and Development investments. Funding for the
theoretical work (proofs) was provided by the Office of Science, Office of Advanced Scientific Com-
puting Research, Scientific Discovery through Advanced Computing (SciDAC) program through
the FASTMath Institute. The views expressed in the article do not necessarily represent the views
of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accept-
ing the article for publication, acknowledges that the U.S. Government retains a nonexclusive,
paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work,
or allow others to do so, for U.S. Government purposes.
A Proof of Lemma 5
Proof. Let us begin by noting that we have established from Step 13 in Algorithm 2 and Step 18
in Algorithm 3 that ∆hk ≥ ∆lk almost surely for any k ∈ N. Hence, if ∆hk converges to zero almost
surely, so does ∆lk . Let us define the following index sets,
26
We have, for any k ∈ H,
X X ∞
X
κR (∆hk )2 ≤ h h e0 − E
(f (Xk ) − f (Xk+1 ) + Ek
e s ) ≤ f h (x0 ) − f h +
k ∗
e0 − E
|Ek
e s |,
k
k∈K k∈K k=0
where f∗h is the optimal value of f h . We note that H and L are disjoint sets and for any k ̸∈ K,
∆hk+1 = γ2 ∆hk . Let K = {k1 , k2 , . . . }, k0 = −1, and ∆h−1 = ∆h0 /γ2 . Then from the fact that
∆hk ≤ γ1 γ2k−ki −1 ∆hki for k = ki + 1, . . . , ki+1 and each i, we obtain
ki+1 ki+1 ∞
X X 2(k−ki −1)
X γ12
(∆hk )2 ≤ γ12 (∆hki )2 γ2 ≤ γ12 (∆hki )2 γ22k = (∆h )2 .
1 − γ22 ki
k=ki +1 k=ki +1 k=0
By Lemma 1, there must exist a sufficiently large K∆ such that |E e0 − E e s | < c∆ (∆h )2 for any
k k k
given c∆ > 0 and any k ≥ K∆ . Then, we have
∞ ∞
!
′
X
h 2 γ12 X h 2 γ12 (∆h0 )2 f h (x0 ) − f∗h + E0,∞
(∆k ) ≤ (∆ ) < +
1 − γ22 i=0 ki 1 − γ22 γ22 κR
k=0
!
h h ′ ′
γ12 (∆h0 )2 f (x0 ) − f∗ + E0,K∆ −1 + EK∆ ,∞
< + ,
1 − γ22 γ22 κR
′ =
Pj e0 e s |. Then we get from E ′ P∞ h 2
where Ei,j k=i |Ek −Ek K∆ ,∞ < c∆ K∆ (∆k ) that
∞ −1
!
h h ′
X γ12 (∆h0 )2 f (x0 ) − f∗ + E0,K∆ −1 γ12 c∆
(∆hk )2 < + 1−
1 − γ22 γ22 κR 1 − γ22 κR
k=K∆
w.p.1
Therefore, ∆hk −−−→ 0 as k → ∞ and the statement of the theorem holds.
27
B Numerical Results (SSCONT)
In this section, we show the performance of ASTRO-DF and ASTRO-BFDF on 20 instances of
the SSCONT problem. µ is the parameter for the demand and θ is the parameter for Lead times.
See Figure 10.
Figure 10: Optimization progress with 95% confidence intervals from 10 runs of ASTRO-DF
and ASTRO-BFDF on SSCONT.
28
(e) µ = 50 and θ = 1 (f) µ = 50 and θ = 3
Figure 10: Optimization progress with 95% confidence intervals from 10 runs of ASTRO-DF
and ASTRO-BFDF on SSCONT.
29
(k) µ = 100 and θ = 6 (l) µ = 100 and θ = 9
Figure 10: Optimization progress with 95% confidence intervals from 10 runs of ASTRO-DF
and ASTRO-BFDF on SSCONT.
30
(q) µ = 400 and θ = 1 (r) µ = 400 and θ = 3
Figure 10: Optimization progress with 95% confidence intervals from 10 runs of ASTRO-DF
and ASTRO-BFDF on SSCONT.
C Implementation Details
All methods used the same parameters (e.g., TR radius ∆k , success ratio η1 ) where possible.
ADAM and Nelder-Mead have used the default setting outlined in SimOpt github [14]. In terms of
the design set selection for the model construction, ASTRO-DF has used 2d + 1 design points with
the rotated coordinate basis (See history-informed ASTRO-DF [22]). In the bi-fidelity scenario,
we have employed two distinct design sets (Xk and Xkl ) at Step 4 in Algorithm 2 and Step 6 in
Algorithm 3 respectively. Xk is selected to construct the local model for the HF function, implying
that the computational costs for estimating the function value at Xk is relatively high. Hence,
the design set will be selected by reusing the design points within the TR and the corresponding
replications as much as possible. To achieve this, we first pick d + 1 design points to obtain
sufficiently affinely independent points by employing Algorithm 4.2 in [23]. After that, we pick
additional d design points following the opposite direction to construct the quadratic interpolation
31
model with diagonal Hessian. Xkl consists of 2d + 1 design points, selected using the coordinate
basis to minimize deterministic error owing to the lower cost of the LF oracle. In this scenario,
the design set Xkl is optimally designed for design sets of any size ranging from d + 2 to 2d + 1
(see [24]).
Hyper-parameters ASTRO-BFDF
∆max (maximum TR radius) problem dependent
2
∆l0 and ∆h0 (initial TR radius) 10⌈log(∆max )−1⌉/d
γ2 (expansion constant) 1.5
γ1 (shrinkage constant) 0.75
λk (sample size lower bound) 5
κ (adaptive sampling constant) F (X0 )/(∆h0 )2
η (model fitness threshold) 0.1
α0 (initial correlation constant) 0.5
ϵ̂ (model gradient threshold) 0.001
ζ (sufficient reduction constant) 0.01
References
[1] A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg, “A theoretical and empirical comparison of
gradient approximations in derivative-free optimization,” Foundations of Computational Mathematics,
vol. 22, no. 2, pp. 507–560, 2022.
[2] R. Chen, M. Menickelly, and K. Scheinberg, “Stochastic optimization using a trust-region method and
random models,” Mathematical Programming, vol. 169, no. 2, pp. 447–487, 2018.
[3] K.-H. Chang, L. J. Hong, and H. Wan, “Stochastic trust-region response-surface method (strong)—a
new response-surface framework for simulation optimization,” INFORMS Journal on Computing,
vol. 25, no. 2, pp. 230–243, 2013.
[4] Y. Ha and S. Shashaani, “Iteration complexity and finite-time efficiency of adaptive sampling trust-
region methods for stochastic derivative-free optimization,” arXiv:2305.10650, 2023.
[5] S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for nonconvex stochastic pro-
gramming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
[6] L. W. Ng and K. E. Willcox, “Multifidelity approaches for optimization under uncertainty,” Interna-
tional Journal for numerical methods in Engineering, vol. 100, no. 10, pp. 746–772, 2014.
[7] B. Peherstorfer, K. Willcox, and M. Gunzburger, “Survey of multifidelity methods in uncertainty
propagation, inference, and optimization,” Siam Review, vol. 60, no. 3, pp. 550–591, 2018.
[8] s. Xu, s. Zhang, s. Huang, s.-H. Chen, s. H. Lee, and s. Celik, “Efficient multi-fidelity simulation
optimization,” in Proceedings of the Winter Simulation Conference 2014, pp. 3940–3951, IEEE, 2014.
[9] S. De, K. Maute, and A. Doostan, “Bi-fidelity stochastic gradient descent for structural optimization
under uncertainty,” Computational Mechanics, vol. 66, pp. 745–771, 2020.
[10] R. Bollapragada, R. Byrd, and J. Nocedal, “Adaptive sampling strategies for stochastic optimization,”
SIAM Journal on Optimization, vol. 28, no. 4, pp. 3312–3343, 2018.
[11] S. Shashaani, F. S Hashemi, and R. Pasupathy, “ASTRO-DF: A class of adaptive sampling trust-region
algorithms for derivative-free stochastic optimization,” SIAM Journal on Optimization, vol. 28, no. 4,
pp. 3145–3176, 2018.
[12] R. Bollapragada, C. Karamanli, and S. M. Wild, “Derivative-free optimization via adaptive sampling
strategies,” arXiv preprint arXiv:2404.11893, 2024.
32
[13] B. Peherstorfer, K. Willcox, and M. Gunzburger, “Optimal model management for multifidelity monte
carlo estimation,” SIAM Journal on Scientific Computing, vol. 38, no. 5, pp. A3163–A3194, 2016.
[14] D. J. Eckman, S. G. Henderson, S. Shashaani, and R. Pasupathy, “SimOpt.” https://github.com/
simopt-admin/simopt, 2023.
[15] Y. Ha, S. Shashaani, and R. Pasupathy, “Complexity of zeroth-and first-order stochastic trust-region
algorithms,” arXiv preprint arXiv:2405.20116, 2024.
[16] J. Müller, “An algorithmic framework for the optimization of computationally expensive bi-fidelity
black-box problems,” INFOR: Information Systems and Operational Research, vol. 58, no. 2, pp. 264–
289, 2020.
[17] X. Song, L. Lv, W. Sun, and J. Zhang, “A radial basis function-based multi-fidelity surrogate model:
exploring correlation between high-fidelity and low-fidelity models,” Structural and Multidisciplinary
Optimization, vol. 60, pp. 965–981, 2019.
[18] D. J. Eckman, S. G. Henderson, and S. Shashaani, “Diagnostic tools for evaluating and comparing
simulation-optimization algorithms,” INFORMS Journal on Computing, vol. 35, no. 2, pp. 350–367,
2023.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.
[20] R. R. Barton and J. S. Ivey Jr, “Nelder-mead simplex modifications for simulation optimization,”
Management Science, vol. 42, no. 7, pp. 954–973, 1996.
[21] L. Mainini, A. Serani, M. P. Rumpfkeil, E. Minisci, D. Quagliarella, H. Pehlivan, S. Yildiz, S. Ficini,
R. Pellegrini, F. Di Fiore, et al., “Analytical benchmark problems for multifidelity optimization meth-
ods,” arXiv preprint arXiv:2204.07867, 2022.
[22] Y. Ha and S. Shashaani, “Towards greener stochastic derivative-free optimization with trust regions
and adaptive sampling,” in 2023 Winter Simulation Conference (WSC), pp. 3508–3519, IEEE, 2023.
[23] S. M. Wild, R. G. Regis, and C. A. Shoemaker, “Orbit: Optimization by radial basis function in-
terpolation in trust-regions,” SIAM Journal on Scientific Computing, vol. 30, no. 6, pp. 3197–3219,
2008.
[24] T. M. Ragonneau and Z. Zhang, “An optimal interpolation set for model-based derivative-free opti-
mization methods,” arXiv preprint arXiv:2302.09992, 2023.
33