Multiobjective Optimization Based On Expensive Robotic Experiments Under Heteroscedastic Noise
Multiobjective Optimization Based On Expensive Robotic Experiments Under Heteroscedastic Noise
2, APRIL 2017
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 469
One promising strategy for expensive MOO is the response often not homoscedastic, and neglecting this sometimes results
surface method (RSM). In this type of algorithm, surrogate in poor function estimation. If the properties of the noise are
functions are constructed to fit the samples. These surrogates known a priori, this knowledge may be able to be coded into
are then used, in place of the unknown true objective functions, the kernel function of the GP regression model. However, since
to plan efficient experiments by balancing exploration and our target is the optimization of expensive functions, we cannot
exploitation. In [9], RSM is used to design the path of a mobile expect accurate prior knowledge. Therefore, we need a more
robot to monitor environments intelligently by making use of flexible framework. Other possible factors may include the ex-
noisy samples, and to this end, proposed an extension of the istence of non-Gaussian noise, and the difference in the rate of
upper confidence bound. A detailed explanation of RSM for change of the true objective functions (nonstationarity). There
single-objective optimization can be found in [10]. The efficient is some literature [31], [32] that deals with the nonstationarity
global optimization algorithm [11], an RSM-based single- problem; however, dealing with this problem is out of the scope
objective optimization method, is extended to MOO based on of this paper.
the aggregation method in [12] and [13]. Emmerich et al. [14] To take input-dependent noise into account, heteroscedas-
suggested using response surfaces to assist EAs, and proposed tic GP regression should be used in RSM. However, this kind
the expected improvement in hypervolume (EIHV) as the rank- of regression presents a difficult challenge because there is no
ing criterion. In [15], they applied a similar approach to [14], but analytical solution, and has been discussed extensively in the
used the lower confidence bound of the improvement. In [16], machine learning community [33], [34]. Among the available
the input to be evaluated in the next step is planned using dif- methods, we chose to use variational heteroscedastic Gaussian
ferent metrics, including approximate EIHV, and selecting four process (VHGP) regression [35] as it gives a reasonable result
or five points in each step. In [17], the authors proposed using with a relatively small computation. VHGP regression has al-
expected maximin fitness improvement, whose analytical form ready been used for RSM in the context of single-objective opti-
was also given for the two-input case. In [18], another statistical mization of a robot [36], but to our knowledge, we are the first to
measure based on the theory of random closed sets is proposed. use this in MOO. In [32], treed GP regression is used to make the
In terms of dealing with noise, most of the existing MOO model more flexible. Although this method would be applica-
methods have been evaluated for noiseless observations, as ble to problems with heteroscedastic noise with relatively small
in [12]. In the EA community, Teich [19] introduced the concept calculation cost, some information would be lost by partitioning
of probability of dominance into a well-known EA-based MOO, the search space into subregions, and training an independent
the strength Pareto evolutionary algorithm (SPEA) [20], to make GP regressor on each of them. Therefore, it would be better if
it robust to noise. Büche et al. [21] also extended SPEA to be we can do without partitioning the search area, in the case where
robust to noise and outliers, and implemented it in optimizing the number of experiments are strictly restricted. Also, note that
the combustion process of a gas turbine. However, these meth- there are other ways to deal with the difficulty of noise that is not
ods cannot be used for expensive optimization as they require modeled as homoscedastic Gaussian noise. In [37], a hyperprior
a considerable number of samples. Eskandari and Geiger [22] on a hyperparameter of the kernel function is introduced. This
proposed the stochastic Pareto genetic algorithm, which is an ex- can attenuate the effect of unmodeled noise and outliers. In [32]
tension of FastPGA [23], an EA for expensive MOO. However, and [38], Student’s t distribution is used, which is known to be
their method also depends on empirical means and variances more robust to outliers than the Gaussian distribution.
that require multiple evaluations for each input, which is not In this paper, we propose a MOO method for expensive noisy
suitable for the optimization of expensive objectives. Fieldsend objectives, particularly those with input-dependent noise. This
and Everson [24] proposed the rolling tide EA, which can handle method uses two GP regression methods to make surrogate
noise that varies in time or space; however, this method requires functions and plan the best experiment based on EIHV. These
too many evaluations to be used in robotic experiments. GP regression methods enable us to make good surrogates from
In the single-objective optimization of robots, the problem the data with input-dependent noise; however, the calculation
of noise has been addressed using RSM with Gaussian process of EIHV and model selection between these two is problematic,
(GP) regression [25], modeling uncertainty induced by obser- because in the heteroscedastic case, the predictive density is
vation noise [26]–[28]. Zuluaga et al. [29] proposed using GP not Gaussian and is, therefore, analytically intractable. In this
regression and the response surface to determine whether a point paper, the approximation of EIHV with reasonable calculation
is Pareto optimal. Independently from this, Tesch et al. [3] pro- cost, and a novel method to determine which regression method
posed using EIHV [14], [30] calculated based on the results to use at every step are also discussed. The effectiveness of the
of GP regression. Noiseless numerical examples exhibited the method is shown by numerical tests and robotic experiments.
superiority of using EIHV over the aggregation function-based The contents of this paper partially appeared in [39]. Compared
method proposed in [12]. However, the performance remained with our previous paper, this paper includes further numerical
insufficient for noisy robotic experiments using a snake robot, verification of the EIHV approximation, the lack of which was
and the authors had to take a mean of five runs per input to get a the primary weak point in [39]. Moreover, additional numerical
reliable result. One possible reason, aside from the overly large verifications are included, which make the efficacy and limita-
noise variance, is that the occurrence of homoscedastic noise tions of the proposed method clear. We also conduct new sets of
(i.e., input-independent noise) was assumed in their work. As is robotic experiments with a different robot to show the efficacy
verified later in this paper, the noise in robotic experiments is of the method in actual robotic problems.
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
470 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
The remainder of this paper is organized as follows. In is important to note that because the goodness of the model
Section II, the algorithm is explained in detail. In Sections III is not taken into consideration in the calculation of EIHV,
and IV, numerical and experimental validations are provided. this can lead to premature termination in cases where the
Section V concludes the paper. model is poor but confident in its prediction, especially at the
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 471
space is assumed to be a subset of RD , the output space is R, same as that of standard GP regression. In the heteroscedastic
and y and f are defined as the vectors whose ith components case, the marginal likelihood is as
are yi and fi = f (xi ), respectively. In standard GP regression,
we assume p(y) = p(y|f , g)p(f |g)p(g)df dg
y = f (x) +
= N (y|f , diag(eg 1 , . . . , eg n ))
f (x) ∼ GP(m(x), kf (x, x ))
∼ N (0, σn2 ) (2) × N (f |0, Kf )N (g|μ0 1, Kg )df dg (7)
where x ∼ P means that a random variable x is taken from a which indicates that our confidence in our regression is not an-
distribution (or a stochastic process) P; is a noise term assumed alytically tractable, making it difficult to tune the hyperparam-
to be taken independently from the same Gaussian distribution eters. To optimize the hyperparameters in this case, VHGP re-
regardless of the input x; and kf is a user-defined kernel function gression maximizes the variational lower bound on the marginal
that expresses our prior knowledge of the latent function. The likelihood instead of marginal likelihood, with respect to the
mean function m(x) is set to be m(x) ≡ 0 to make the following variational parameters and the hyperparameters.
calculation concise; however, in real applications, this will also Define a function F as
be used to code our prior knowledge.
One of the most frequently used kernel functions is the F (q(f ), q(g)) = log p(y) − KL(q(f )q(g) p(f , g|y)) (8)
squared exponential kernel (SE kernel, Gaussian kernel)
where q(f ) and q(g) are the variational probability densities, and
1 KL(· ·) is the Kullback–Leibler (KL) divergence. Because KL
kf (xi , xj ) = σf2 exp − (xi − xj )T M (xi − xj ) (3)
2 divergence is nonnegative, F gives the lower bound of the loga-
rithm of the marginal likelihood p(y). Therefore, we maximize
where M = diag(l1−2 , . . . , lD
−2
). Parameters σn , σf , and li (i = F instead of the marginal likelihood. To obtain the maximization
1, . . . , D) are called the hyperparameters and should be tuned of F , the dependence on q(f ) can be eliminated by assuming
based on the samples. that q(g) is fixed, and using the variational principle as the first
To tune the hyperparameters, we maximize the marginal like- step. This results in the optimal q(f ) as a function of q(g), and
lihood or evidence by substituting it back into F , F is transformed into what is
p(y) = p(y|f )p(f )df = N (y|f , σn2 I)N (f |0, K)df (4) called the marginalized variational bound
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
472 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
where
EI(x|g) = I(y)p(y|g, x)dy
Fig. 2. Surrogate function (thick line) interpolating the sampled points of an
unknown underlying function with estimated uncertainty (three sigma interval)
shown in the gray area, and the EI for single-objective maximization (thick
= I(y)N (y|a, Σ)dy
dashed line).
Σ = diag(c21 + eg 1 , . . . , c2m + eg m ) (15)
where ck and gk correspond to the kth objective. For EI(x|g),
B. Expected Improvement in Hypervolume the closed form derived in [30] can be used, and because p(g|x)
Expected improvement (EI) is a popular statistical measure is a Gaussian density function, EI(x) can be calculated numer-
to make an efficient experimental plan for the next step, which ically by Gauss–Hermite quadrature if the number of objectives
automatically balances the tradeoff between exploration and m is small. However, even with the closed form [30], the cal-
exploitation without requiring a tuning parameter. To define EI, culation of EIHV is still time consuming, and the following
we first have to define the improvement. approximation of EIHV gives equivalent or better results with
In the single-objective case, the improvement at x with the much less computation, as will be shown by numerical examples
value y is the increase in the maximum sampled target value. in Section III-E2:
The expectation of improvement is ¯
EI(x) = I(y)N (y|a, Σ̄)dy (16)
∞
EI(x) = (y − max(Ỹ ))p(y|x)dy (12) where
m ax( Ỹ )
2 2
Σ̄ = diag(c21 + eμ 1 +σ 1 /2 , . . . , c2m + eμ m +σ m /2 ). (17)
where Ỹ is the set of sampled target values. Fig. 2 illustrates the
This is given by approximating the predictive density (10) by a
concept of EI. The dark colored area represents the probability
Gaussian distribution with the same mean and variance as the
for the sample at the point to give a better result than the current
true density calculated in (11), and can be calculated by the
best one. Since EI not only considers the probability that this
formula in [30]. In the limit of σ → 0 (i.e., in the limit of no
point is better but also by how much, the point with the highest
uncertainty in noise variance), (16) tends to be identical to (14).
probability to outperform the current best point does not always
Therefore, (16) is expected to give a good approximation in the
have the highest EI value. In general, between two points with
case where σ is small compared with |μ|. Note that the value of
equal predictive mean, higher predictive variance implies higher
EIHV itself is not so important for our purpose because we need
EI, and two points with equal predictive variance, higher predic-
only the maximizer of EIHV and not EIHV itself. Although
tive mean implies higher EI. In this figure, the rightmost peak
this approximation may not be as accurate as Gauss–Hermite
of the EI corresponds to its maximum, and therefore the input
quadrature, numerical examples in Section III-E2 show that the
that attains this will be used for the next experiment. In the case
discrepancy becomes small at the neighbor of the maximum of
where the predictive distribution p(y) is Gaussian, the analytic
EIHV, which implies that this approximation is sufficient for
form of EI can be obtained [11].
experiment planning.
In MOO, because the solution is not a single point but a whole
Pareto set of points, the improvement must capture the change
C. Model Selection
in the quality of this set. One metric that expresses the quality
of the set of solutions is the set’s hypervolume [41]. This is the Because in most cases we have little prior knowledge of the
volume in objective space that is Pareto-dominated by at least objective functions, we have to choose the best prior distribu-
one point in the Pareto subset of the set in question, at the same tion or model from multiple candidates. In particular, selection
time dominating a user-defined reference point, which basically between the standard GP model and the HGP model is impor-
defines the lower bounds of objective values. tant. As the HGP model is more complex, it is more likely to
Let HV (A) be the hypervolume of a set A; the improvement overfit than the standard GP model. The problem is not only that
in the case where the output of m objective functions is y ∈ Rm the HGP model is more prone to overfitting, but also that this
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 473
Fig. 3. Two types of overfitting. The crosses are samples, the solid lines are
mean functions, and the colored areas show 2-σ intervals. (a) Common to both
regressions. (b) Special to HGP regression.
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
474 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
TABLE I
TRUE HYPERVOLUME
MAT T3 T4 T6
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 475
Fig. 6. Graph of the test function T3 . (a) Objective 1. (b) Objective 2. Fig. 9. Comparison between the existing method (std. GP regression only)
and the method using VHGP regression only. The test function is MAT (hyper-
volume: 5.1013). (a) 5 initial evaluations. (b) 15 initial evaluations.
C. Performance Metric
For the selection metric of the optimization method, the hy-
pervolume indicator [41] was used. This is a common unary
indicator that has been examined closely [45]. We calculated
the true noiseless function values at the points of the resul-
tant approximated Pareto set. The hypervolume was calculated
based on these true function values, instead of the sampled val-
Fig. 7. Graph of the test function T4 . (a) Objective 1. (b) Objective 2.
ues, because hypervolume calculated based on noisy samples
can be under- or overestimated. Note that the number of points
in the approximated Pareto set varies from trial to trial, and is
not constant. We ran 60 trials for each setting and calculated the
empirical median, 25th percentile, and 75th percentile of the hy-
pervolume. In the tests, the approximated Pareto front is gener-
ated from the evaluated points. However, note that it is suggested
that constructing the approximated Pareto front from surrogate
functions would give a better approximation than constructing it
only from evaluations [46]. Therefore, the performance shown
in what follows can be understood as a lower limit.
Fig. 8. Graph of the test function T6 . (a) Objective 1. (b) Objective 2. If there is no noise on the observations, the hypervolume
will increase monotonically as the number of observations gets
larger. However, in our case, because the algorithm plans the
experiments and returns the approximated Pareto set based on
The domain is [0, 1] × [−5, 5]. The graphs of the functions are
the noisy samples, the corresponding estimated hypervolume
shown in Fig. 7. The reference point for hypervolume calculation
can decrease.
is taken to be [−1, −45].
4) Test Function T6 : The test function T6 used in this re-
search is defined as D. Common Settings
To solve the maximization problem of EIHV, we chose to
f1 (x1 , x2 ) = −1 + exp(−4x1 )sin6 (6πx1 ) first calculate EIHV at densely sampled points, and then used
the maximizer among these points as the starting point of the
f2 (x1 , x2 ) = −g 1 − (f1 /g)2
gradient method. Of course, other kinds of maximization meth-
1 ods, such as a gradient method with random restart or some
g(x2 ) = 1 + 9x24 . (24)
EAs, are applicable.
The domain is [0, 1] × [0, 1]. The graphs of the functions are Regarding the initial settings of the hyperparameters, we set
shown in Fig. 8. The reference point for hypervolume calculation all of the initial values of the hyperparameters at 1 for standard
is taken to be [−1, −10]. homoscedastic GP regression. For heteroscedastic regression,
we used the result of homoscedastic regression to set the initial
hyperparameter values.
B. Additive Noise
All numerical tests were repeated 60 times to make the results
We tested with two kinds of Gaussian noise; the first was statistically reliable. To illustrate the results, hypervolumes were
homoscedastic with variance r(x) = σ̄n2 (= const.), and the sec- plotted against the number of evaluations (Figs. 9 and 12–22).
ond was heteroscedastic with variance r(x) = {σn (sin( x ) + Because the distribution of hypervolumes after a fixed number
1)/2}2 . of evaluations is skewed, we used the median and the 25th/75th
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
476 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
E. Results
1) Need for Model Selection Between Two Kinds of GPs:
First, to show the need for the model selection discussed in
Section II-C, we compare the performances of two cases: 1)
standard GP regression only (existing method [3]) and 2) VHGP
regression only. The tests were performed with two different
settings for the number of initial points: 5 and 15. The test
function MAT was used, and homoscedastic noise (σ̄n = 0.15)
was added to observations. A maximum of 40 evaluations were
performed for each trial. The results are shown in Fig. 9.
In the case with five initial evaluations, the existing method
(standard GP regression only) clearly outperforms the other
(VHGP regression only). This is because the VHGP model is
more complex than the standard GP model, and tends to overfit Fig. 10. Slices of the negative log EIHV surface calculated by Gauss–Hermite
to the small size data. In the case with 15 initial points, both quadrature (14) and Gaussian approximation (16) [see (a) and (b)], and their
differences [see (c) and (d)]. The slices are axis-aligned [(a) and (c) with
methods work equally well. x2 = 2.0202 and (b) and (d) with x1 = 3.9394], and go through the point
The problem is that, in real problems, the necessary number of with maximum discrepancy between the methods. Homoscedastic noise (σ̄ n
initial evaluations for VHGP regression is likely unknown, and = 0.15) is used. (a) −log(EIH V ) after 40 evaluations. (b) −log(EIH V ) af-
ter 40 evaluations. (c) Discrepancy after 40 evaluations. (d) Discrepancy after
this in turn shows the need for model selection between standard 40 evaluations.
GP and VHGP. We show in the following that through the model
selection methods proposed in Section II-C, the results will be
at least as good as, and in many cases better than, the best of
methods 1) and 2).
2) Comparison Between Two EI Calculations, (14) and (16):
In this test, we compared the performances of the two calcula-
tions of EIHV in the case of VHGP regression: Gauss–Hermite
quadrature (14) and Gaussian approximation of the predictive
density (16). Here, only VHGP regression was used and not
standard GP regression.
The tests were done for MAT with two kinds of additive
noise: homoscedastic noise with σ̄n = 0.15, and sinus noise
with σn = 0.2, as explained in Section III-B. The number of
initial evaluations was set as 15. For Gauss–Hermite quadrature,
nine (3 × 3) nodes were used, which gave enough precision for
EIHV calculation.
Figs. 10 and 11 show axis-aligned slices of the negative log
EIHV surface after 40 evaluations, as calculated by each method,
and the corresponding slices of the discrepancy. The planes
shown are selected to go through the point with maximum error.
It can be seen that the discrepancy becomes large at the maxima Fig. 11. Slices of the negative log EIHV surface calculated by Gauss–Hermite
quadrature (14) and Gaussian approximation (16) [see (a) and (b)], and their
of the negative log EIHV, and at the points with small negative differences [see (c) and (d)]. The slices are axis-aligned with x1 = 7.7778 [see
log EIHV, the discrepancy becomes small. Because we need the (a) and (c)] and x2 = 1.5152 [see (b) and (d)]. Sinus noise (σ n =0.2) is used.
minimizer of the negative log EIHV (maximizer of the EIHV), (a) −log(EIH V ) after 40 evaluations. (b) −log(EIH V ) after 40 evaluations.
(c) Discrepancy after 40 evaluations. (d) Discrepancy after 40 evaluations.
the approximation (16) is considered appropriate for our pur-
pose, though it can be inaccurate in the area that we are not
interested in. Additionally, we observed that σ∗ in (10) are very
small compared with |μ∗ |; in fact, σ∗ /|μ∗ | is around 10−5 –10−10 each step in the case with sinus noise. The total time required
at the minimizer of the negative log EIHV, which verifies the for one trial was the integral of the curve, which was about 6 h
use of the approximation (16). where Gauss–Hermite quadrature was used, and about 44 min
Hypervolumes were plotted against the number of evalua- in the other case. From these results, it can be seen that the
tions in Fig. 12, and Fig. 13(a) shows the time consumption for calculation of EIHV through Gauss–Hermite quadrature is very
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 477
Fig. 12. Comparison between two metric calculations: (14) and its approxi- Fig. 14. Comparison between LOO and MC as the model selection method.
mation (16). Because we use 15 initial evaluations, the graphs begin with 16 The number of initial evaluations is 15. The test function is MAT (hyper-
evaluations. The test function is MAT (hypervolume: 5.1013). (a) Homoscedas- volume: 5.1013). (a) Homoscedastic noise (σ̄ n = 0.15). (b) Sinusoidal noise
tic noise (σ̄ n = 0.15). (b) Sinusoidal noise (σ n = 0.2). (σ n = 0.2).
Fig. 13. Time needed to complete one step of the procedure. (a) Simplified
EI versus Complete EI. (b) LOO versus MC. Fig. 15. Comparison between the proposed method (model selection) and the
existing method (std. GP regression only) [3]. The number of initial evaluations
is 15. The test function is MAT (hypervolume: 5.1013). (a) Homoscedastic noise
(σ̄ n = 0.15). (b) Sinusoidal noise (σ n = 0.2).
time consuming given the insignificant improvements in accu-
racy.
Because our goal is to find the solution set with as few evalua-
tions as possible and not to reduce the calculation cost, a method
that requires a great amount of calculation can be used as long as
it contributes to reducing the number of necessary observations.
However, because the time-consuming calculation (14) did not
reduce the required number of samples, we concluded that the
approximation (16) should be applied instead.
3) Comparison Between Two Model Selection Methods: For
selecting between standard GP and VHGP, two methods are
compared: LOO as explained in Section II-C, and numerical
calculation of marginal likelihood by MC. See the Appendix for Fig. 16. Comparison between the proposed method (model selection) and
details on MC calculation. MAT was used as the test function. the method that uses VHGP regression only. The test function is MAT (hyper-
The results are shown in Fig. 14, and time required for each step volume: 5.1013). (a) Homoscedastic noise (σ̄ n = 0.15). (b) Sinusoidal noise
(σ n = 0.2).
of the procedure is shown in Fig. 13(b).
From Figs. 13(b) and 14, it can be seen that the performance
is slightly better if LOO is used, but required time is also less
with LOO if there are less than about 40 points. The poorer
performance of MC is attributed to the limited precision of the
calculation of marginal likelihood. Because precision through
MC is proportional to the square root of the sample size, it is
difficult to precisely calculate the marginal likelihood, which is
typically much smaller than 1. Moreover, we usually use the
logarithm of marginal likelihood, which amplifies the error as
log (p + Δp) − log (p) ≈ Δp/p > Δp (25) Fig. 17. Comparison between the proposed method and the existing
method [3]. The number of initial evaluations is 20. Test function is T3 (hyper-
where p < 1 is the marginal likelihood. Therefore, we need sev- volume: 10.0444). (a) Homoscedastic noise (σ̄ n = 0.1). (b) Sinusoidal noise
eral hundred times more samples to get ten times more precise (σ n = 0.2).
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
478 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
Fig. 18. Comparison between the proposed method and the method that uses Fig. 21. Comparison between the proposed method and the existing
VHGP regression only. Test function is T3 (hypervolume: 10.0444). (a) Ho- method [3]. The number of initial evaluations is 40. Test function is T6 (hy-
moscedastic noise (σ̄ n = 0.1). (b) Sinusoidal noise (σ n = 0.2). pervolume: 6.7989). (a) Homoscedastic noise (σ̄ n = 0.1). (b) Sinusoidal noise
(σ n = 0.1).
Fig. 19. Comparison between the proposed method and the existing
method [3]. The number of initial evaluations is 20. Test function is T4 (hyper-
volume: 44.6667). (a) Homoscedastic noise (σ̄ n = 0.2). (b) Sinusoidal noise
(σ n = 0.25).
Fig. 22. Comparison between the proposed method and the method that uses
VHGP regression only. Test function is T6 (hypervolume: 6.7989). (a) Ho-
moscedastic noise (σ̄ n = 0.1). (b) Sinusoidal noise (σ n = 0.1).
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 479
Fig. 23. Comparison between the proposed method and the existing method.
Test function is T3 with 10-D input (hypervolume: 10.0444). (a) Homoscedastic
noise (σ̄ n = 0.1). (b) Sinusoidal noise (σ n = 0.2).
IV. EXPERIMENTS
We conducted robotic experiments with a snake robot (Fig.
24), which moves via sidewinding locomotion. The objective
functions were set to be the speed and the stability of the robot
head. Here, head stability is roughly inversely proportional to the
amount of head motion, which is very important when operating
the robot. Again, in the experiments, we did not implement
Lines 11–13 of Algorithm 1.
The snake robot is often controlled by motions with a finite-
dimensional constrained control trajectory subspace (the same
as the gait model described in [47]), which is defined as
Fig. 25. Resulting Pareto front in the case of (a) proposed method and
βeven + Aeven sin(θ), n = even (b) existing method [3] after 25, 30, 35, and 40 evaluations. Observed data
α(n, t) = are shown in × marks and the circles are the elements in the Pareto front.
βodd + Aodd sin(θ + δ), n = odd
θ = (ωs n + ωt t) (26)
where α(n, t) is the nth joint angle at time t. This can be seen 1) The speed: The net displacement of the head after running
as an extension of the serpenoid curve [48], which models the the snake for 15 s.
shape of a real snake well. There are seven free parameters 2) The head stability: The stability of the image from the
(βs, As, δ, ωs , and ωt ), and by tuning these parameters, we camera mounted on the head of the snake.
can command the snake to move via many kinds of motions, For head stability, we put an acceleration sensor on the head,
including slithering, sidewinding, and rolling in an arc. and calculated the level of vibration as follows. Let a(t) be the
Snake robots are expected to be useful for many applications, read of acceleration sensor, then the head stability is defined as
including pipe inspections and urban search and rescue opera- T
tions. In practice, the operator would have to rely on the camera
(Head Stability) = − a(t) − E[a(t)] 2 dt (27)
image from the camera mounted on the head, which makes it 0
challenging to operate the robot from a distance. Although a
method that would provide information about the state of the where E[a(t)] is the mean of a(t) during one run.
robot based on the virtual chassis [49] has been proposed, it Although it is not clear only from the above definitions
remains difficult to operate the robot if the camera is mounted whether there is a tradeoff between the two objectives, the pro-
on a shaky base. At the same time, it would be advantageous posed method, along with other MOO methods, can be used for
to move the robot faster, which will amplify the movement of cases where the objectives do not actually conflict. Therefore,
the head camera. Therefore, we chose two objective functions if there is no reason to deny the existence of conflict between
as follows. objectives, it is better to turn to MOO.
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
480 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
TABLE II
OBSERVED VALUE OF THE HEAD STABILITY
Fig. 26. Example of response surface of head stability and speed after 39
experiments, which was used to plan the 40th experiment. The color shows the
variance of the predictive distribution. (c) and (d) Variance as color maps, from Samp.1 Samp.2 Samp.3 Samp.4 Samp.5 st. dev.
which the nonuniformity of the uncertainty can be ascertained. (a) Head stability.
(b) Head speed. (c) Uncertainty in head stability estimation. (d) Uncertainty in x1 19.0062 15.2082 22.6125 16.6942 17.4333 2.8252
head stability estimation. x2 10.4526 9.4061 9.8783 9.6554 11.5510 0.8542
x3 7.0852 6.9981 6.4631 6.8942 7.1128 0.2644
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 481
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
482 IEEE TRANSACTIONS ON ROBOTICS, VOL. 33, NO. 2, APRIL 2017
[23] H. Eskandari and C. D. Geiger, “A fast pareto genetic algorithm approach [45] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. Grunert da Fon-
for solving expensive multiobjective optimization problems,” J. Heuris- seca, “Performance assessment of multiobjective Optimizers: An analysis
tics, vol. 14, no. 3, pp. 203–241, Jun. 2008. and review,” IEEE Trans. Evol. Comput., vol. 7, no. 2, pp. 117–132,
[24] J. E. Fieldsend and R. M. Everson, “The rolling tide evolutionary algo- Apr. 2003.
rithm: A multiobjective optimizer for noisy optimization problems,” IEEE [46] R. Calandra, J. Peters, and M. P. Deisenroth, “Pareto front modeling for
Trans. Evol. Comput., vol. 19, no. 1, pp. 103–117, Feb. 2015. sensitivity analysis in multi-objective Bayesian optimization,” in Proc.
[25] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine NIPS Workshop Bayesian Optim., 2014, vol. 5.
Learning. Cambridge, MA, USA: MIT Press, 2006, [47] M. Tesch et al., “Parameterized and scripted gaits for modular snake
[26] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans, “Automatic gait robots,” Adv. Robot., vol. 23, no. 9, pp. 1131–1158, 2009.
optimization with Gaussian process regression,” in Proc. Int. Joint Conf. [48] S. Hirose, Biologically Inspired Robots: Snake-Like Locomotiors and Ma-
Artif. Intell., Hyderabad, India, 2007, pp. 944–949. nipulators. Oxford, U.K.: Oxford Univ. Press, 1993.
[27] M. Tesch, J. Schneider, and H. Choset, “Using response surfaces and [49] D. Rollinson and H. Choset, “Virtual chassis for snake robots,” in Proc.
expected improvement to optimize snake robot gait parameters,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., San Francisco, CA, 2011,
IEEE/RSJ Int. Conf. Intell. Robots Syst., San Francisco, CA, USA, 2011, pp. 221–226.
pp. 1069–1074. [50] S. Chib and I Jeliazkov, “Marginal likelihood from the Metropolis-
[28] R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian Hastings output,” J. Amer. Statist. Assoc., vol. 96, no. 453, pp. 270–281,
optimization for learning gaits under uncertainty,” Ann. Math. Artif. Intell., Mar. 2001.
vol. 76, pp. 5–23, 2016. [51] S. Chib and I Jeliazkov, “Accept-reject Metropolis-Hastings sampling and
[29] M. Zuluaga, A. Krause, G. Sergent, and M. Püschel, “Active learning for marginal likelihood estimation,” Statistica Neerlandica, vol. 59, no. 1,
multi-objective optimization,” in Proc. Int. Conf. Mach. Learn., Atlanta, pp. 30–44, Feb. 2005.
GA, USA, 2013, pp. 462–470.
[30] M. Emmerich and J.-W. Klinkenberg, “The computation of the expected
improvement in dominated hypervolume of Pareto front approximations,”
Leiden Inst. Adv. Comput. Sci., Leiden, The Netherlands, Tech. Rep. 1,
2008.
[31] J.-A.M. Assael, Z. Wang, B. Shahriari, and N. Freitas, “Heteroscedastic
treed Bayesian optimization,” arXiv:1410.7172, 2014. Ryo Ariizumi (M’15) received the B.S., M.E., and
[32] M. A. Taddy, H. K. H. Lee, G. A. Gray, and J. D. Griffin, “Bayesian guided Ph.D. degrees in engineering from Kyoto University,
pattern search for robust local optimization,” Technometrics, vol. 51, no. 4, Kyoto, Japan, in 2010, 2012, and 2015, respectively.
pp. 389–401, 2009. He was a Postdoctoral Researcher with Kyoto Uni-
[33] P. Goldberg, C. Williams, and C. Bishop, “Regression with input- versity. He is currently an Assistant Professor with
dependent noise: A Gaussian process treatment,” in Proc. Adv. Neural Nagoya University, Nagoya, Japan. His research in-
Inf. Process. Syst., Denver, CO, USA, 1997, pp. 493–499. terests include control of redundant robots and opti-
[34] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely het- mization of robotic systems.
eroscedastic Gaussian processes regression,” in Proc. Int. Conf. Mach.
Learn., Corvallis, OR, USA, 2007, pp. 393–400.
[35] M. Lázaro-Gredilla and M. K. Titsias, “Variational heteroscedastic Gaus-
sian process regression,” in Proc. Int. Conf. Mach. Learn., Bellevue, WA,
USA, 2011, pp. 841–848.
[36] S. Kuindersma, R. Grupen, and A. Barto, “Variable risk control via
stochastic optimization,” Int. J. Robot. Res., vol. 32, no. 7, pp. 806–825,
Jun. 2013.
[37] R. Martinez-Cantin, N. Freitas, E. Brochu, J. Castellanos, and A. Doucet, Matthew Tesch (M’13) received the B.S. degree in
“A Bayesian exploration-exploitation approach for optimal online sensing engineering from Franklin W. Olin College of Engi-
and planning with a visually guided mobile robot.” Auton. Robots, vol. 27, neering, Needham, MA, USA, in 2007 and the M.S.
no. 2, pp. 93–103, 2009 and Ph.D. degrees in robotics from Carnegie Mellon
[38] R. Martinez-Cantin, “BayesOpt: A Bayesian optimization library for non- University, Pittsburgh, PA, USA, in 2011 and 2013,
linear optimization, experimental design and bandits,” J. Mach. Learn. respectively.
Res., vol. 15, no. 1, pp. 3735–3739, 2014. He is a Postdoctoral Researcher with Carnegie
[39] R. Ariizumi, M. Tesch, H. Choset, and F. Matsuno, “Expensive multi- Mellon University. His research focuses on machine
objective optimization for robotics with consideration of heteroscedastic learning methods for efficiently optimizing robotic
noise,” in Proc. IEEE/RSJ Int. Conf. Intell. Robot. Syst., Chicago, IL, USA, systems from which data collection is time consum-
2014, pp. 2230–2235. ing or expensive. Application of this work has re-
[40] M. D. Mackay, W. J. Conover, and R. J. Beckman, “A comparison of three sulted in significant improvements of the locomotive capabilities of snake robots.
methods for selecting values of input variables in the analysis of output
from a computer code,” Technometrics, vol. 21 no. 2, pp. 239–245, May
1979.
[41] E. Zitzler and L. Thiele, “Multiobjective optimization using evolution-
ary algorithms—A comparative case study,” in Parallel Problem Solv-
ing From Nature—PPSN V. New York, NY, USA: Springer, Sep. 1998,
pp. 292–304. Kenta Kato received the B.S. and M.E. degrees in
[42] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. Freitas, “Bayesian mechanical engineering from Kyoto University, Ky-
optimization in a billion dimensions via random embeddings.” J. Artif. oto, Japan, in 2014 and 2016, respectively.
Intell. Res., vol. 55, pp. 361–387, 2016 He is with Mitsubishi Electric Corporation.
[43] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective evo- His research interests include the semiautonomous
lutionary algorithms: Empirical results,” Evol. Comput., vol. 8, no. 2, control of rescue robots and machine learning, espe-
pp. 173–195, 2000. cially expensive multiobjective optimization.
[44] L. Dixon and G. Szego, “The global optimization problem: An introduc-
tion,” in Towards Global Optimization, vol. 2. Amsterdam, The Nether-
lands: North-Holland, 1978, pp. 1–15.
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.
ARIIZUMI et al.: MULTIOBJECTIVE OPTIMIZATION BASED ON EXPENSIVE ROBOTIC EXPERIMENTS UNDER HETEROSCEDASTIC NOISE 483
Howie Choset (M’92) received the Ph.D. degree in Fumitoshi Matsuno (M’94) received the Ph.D. (Dr.
mechanical engineering from California Institute of Eng.) degree in engineering from Osaka University,
Technology, Pasadena, CA, USA, in 1996. He is a Osaka, Japan, in 1986.
Professor of robotics with Carnegie Mellon Univer- In 1986, he joined the Department of Con-
sity, Pittsburgh, PA, USA. Motivated by applications trol Engineering, Osaka University. He became a
in confined spaces, he has created a comprehensive Lecturer and an Associate Professor in 1991 and
program in snake robots, which has led to basic re- 1992, respectively, in the Department of Systems En-
search in mechanism design, path planning, motion gineering, Kobe University. In 1996, he joined the
planning, and estimation. By pursuing the fundamen- Department of Computational Intelligence and Sys-
tals, this research program has made contributions to tems Science, Interdisciplinary Graduate School of
coverage tasks, dynamic climbing, and large-space Science and Engineering, Tokyo Institute of Tech-
mapping. He has already directly applied this body of work in challenging and nology, as an Associate Professor. In 2003, he became a Professor with the
strategically significant problems in diverse areas, such as surgery, manufactur- Department of Mechanical Engineering and Intelligent Systems, University of
ing, infrastructure inspection, and search and rescue. He directs the Undergrad- Electro-Communications. Since 2009, he has been a Professor with the Depart-
uate Robotics Minor Program at Carnegie Mellon University and teaches an ment of Mechanical Engineering and Science, Kyoto University, Kyoto, Japan.
overview course on robotics, which uses series of custom developed Lego Labs He is also a Vice President of NPO International Rescue Systems Institute and
to complement the course work. the Institute of Systems, Control and Information Engineers. His research in-
Prof. Choset’s students have received Best Paper Awards at the RIA in 1999 terests include robotics, control of distributed parameter systems and nonlinear
and ICRA in 2003; his group’s work has been nominated for best papers at ICRA systems, rescue support systems in fires and disasters, and swarm intelligence.
in 1997, IROS in 2003 and 2007, and CLAWAR in 2012 (Best Biorobotics Paper Dr. Matsuno has received many awards, including the Outstanding Paper
and Best Student Paper); they also received the best paper at IEEE Bio Rob in Award in 2001 and 2006, the Takeda Memorial Prize in 2001 from the Society of
2006, best video at ICRA 2011, and were nominated for the best video in ICRA Instrument and Control Engineers (SICE), and the Outstanding Paper Award in
2012. In 2002, the MIT Technology Review elected him as one of its top 100 2013 from the Information Processing Society of Japan. He is a Fellow Member
innovators under 35 in the world. In 2005, he authored a book titled Principles of the SICE, the Japan Society of Mechanical Engineers, the Robotics Society
of Robot Motion (MIT Press). of Japan, and a member of the ISCIE, among other organizations. He served as
the Cochair of the IEEE Robotics and Automation Society Technical Committee
on Safety, Security, and Rescue Robotics; the Chair of the Steering Committee
of the SICE Annual Conference; and is an Editor for Journal of Intelligent and
Robotic Systems, is an Associate Editor for Advanced Robotics and International
Journal of Control, Automation, and Systems among others, and is an Edito-
rial Board Member with the Conference Editorial Board of the IEEE Control
Systems Society.
Authorized licensed use limited to: Izmir Katip Celebi Univ. Downloaded on October 25,2023 at 08:17:45 UTC from IEEE Xplore. Restrictions apply.