1907.10902v1
1907.10902v1
∗
Takuya Akiba1 , Shotaro Sano1 , Toshihiko Yanase1 , Takeru Ohta1 , and Masanori Koyama1
1
Preferred Networks, Inc.
Abstract
The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software.
The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space
dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile
architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight
arXiv:1907.10902v1 [cs.LG] 25 Jul 2019
experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization
software which is a culmination of our effort in the development of a next generation optimization software. As an
optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present
the design-techniques that became necessary in the development of the software that meets the above criteria, and
demonstrate the power of our new design through experimental results and real world applications. Our software is
available under the MIT license (https://github.com/pfnet/optuna/).
Preprint doi
opt [1], Spearmint [2], SMAC [3], Autotune [4], and Vizier [5] setup requirements. If possible, architecture shall be installable
were all developed in order to meet this need. with a single command as well, and it shall be designed as
The choice of the parameter-sampling algorithms varies across an open source software so that it can continuously incorporate
frameworks. Spearmint [2] and GPyOpt use Gaussian Pro- newest species of optimization methods by interacting with open
cesses, and Hyperopt [1] employs tree-structured Parzen estima- source community.
tor (TPE) [6]. Hutter et al. proposed SMAC [3] that uses random In order to address these concerns, we propose to introduce the
forests. Recent frameworks such as Google Vizier [5], Katib following new design criteria for next-generation optimization
and Tune [7] also support pruning algorithms, which monitor framework:
the intermediate result of each trial and kills the unpromising
trials prematurely in order to speed up the exploration. There is • Define-by-run programming that allows the user to dynami-
an active research field for the pruning algorithm in hyperparam- cally construct the search space,
eter optimization. Domhan et al. proposed a method that uses
parametric models to predict the learning curve [8]. Klein et al. • Efficient sampling algorithm and pruning algorithm that
constructed Bayesian neural networks to predict the expected allows some user-customization,
learning curve [9]. Li et al. employed a bandit-based algorithm
• Easy-to-setup, versatile architecture that can be deployed
and proposed Hyperband [10].
for tasks of various types, ranging from light-weight experi-
A still another way to accelerate the optimization process is to ments conducted via interactive interfaces to heavy-weight
use distributed computing, which enables parallel processing of distributed computations.
multiple trials. Katib is built on Kubeflow, which is a computing
platform for machine learning services that is based on Kuber- In this study, we will demonstrate the significance of these
netes. Tune also supports parallel optimization, and uses the Ray criteria through Optuna, an open-source optimization software
distributed computing platform [11]. which is a culmination of our effort in making our definition of
next-generation optimization framework come to reality.
However, there are several serious problems that are being
overlooked in many of these existing optimization frameworks. We will also present new design techniques and new optimiza-
Firstly, all previous hyperparameter optimization frameworks to tion algorithms that we had to develop in order to meet our
date require the user to statically construct the parameter-search- proposed criteria. Thanks to these new design techniques, our
space for each model, and the search space can be extremely implementation outperforms many major black-box optimiza-
hard to describe in these frameworks for large-scale experiments tion frameworks while being easy to use and easy to setup in
that involve massive number of candidate models of different various environments. In what follows, we will elaborate each
types with large parameter spaces and many conditional vari- of our proposed criteria together with our technical solutions,
ables. When the parameter space is not appropriately described and present experimental results in both real world applications
by the user, application of advanced optimization method can and benchmark datasets.
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 2
Optuna is released under the MIT license (https://github. 2.1 Modular Programming
com/pfnet/optuna/), and is in production use at Preferred
A keen reader might have noticed in Figure 3 that the optimiza-
Networks for more than one year.
tion code written in Optuna is highly modular, thanks to its
define-by-run design. Compatibility with modular programming
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 3
Table 1: Software frameworks for deep learning and hyperparameter optimization, sorted by their API styles: define-and-run and
define-by-run.
Deep Learning Frameworks Hyperparameter Optimization Frameworks
Define-and-Run Style Torch (2002), Theano (2007), Caffe (2013), SMAC (2011), Spearmint (2012), Hyperopt (2015), GPyOpt (2016),
(symbolic, static) TensorFlow (2015), MXNet (2015), Keras (2015) Vizier (2017), Katib (2018), Tune (2018), Autotune (2018)
Define-by-Run Style Chainer (2015), DyNet (2016), PyTorch (2016),
Optuna (2019; this work)
(imperative, dynamic) TensorFlow Eager (2017), Gluon (2017)
Table 2: Comparison of previous hyperparameter optimization frameworks and Optuna. There is a checkmark for lightweight if
the setup for the framework is easy and it can be easily used for lightweight purposes.
rameters found by the algorithm. The above example (Figure 4) existing frameworks do not provide efficient pruning strategies.
might make it seem as if the user has to write a different version In this section we will provide our design for both sampling and
of the objective function that does not invoke ‘trial.suggest’ pruning.
in order to deploy the best configuration. Luckily, this is not
a concern. For deployment purpose, Optuna features a sepa- 3.1 Sampling Methods on Dynamically Constructed
rate class called ‘FixedTrial’ that can be passed to objective Parameter Space
functions. The ‘FixedTrial’ object has practically the same
set of functionalities as the trial class, except that it will only There are generally two types of sampling method: relational
suggest the user defined set of the hyperparameters when passed sampling that exploits the correlations among the parameters
to the objective functions. Once a parameter-set of interest is and independent sampling that samples each parameter inde-
found (e.g., the best ones), the user simply has to construct a pendently. The independent sampling is not necessarily a naive
‘FixedTrial’ object with the parameter set. option, because some sampling algorithms like TPE [6] are
known to perform well even without using the parameter correla-
2.3 Historical Remarks tions, and the cost effectiveness for both relational and indepen-
dent sampling depends on environment and task. Our Optuna
Historically, the term define-by-run was coined by the develop- features both, and it can handle various independent sampling
ers of deep learning frameworks. In the beginning, most deep methods including TPE as well as relational sampling methods
learning frameworks like Theano and Torch used to be declar- like CMA-ES. However, some words of caution are in order
ative, and constructed the networks in their domain specific for the implementation of relational sampling in define-by-run
languages (DSL). These frameworks are called define-and-run framework.
frameworks because they do not allow the user to alter the ma-
nipulation of intermediate variables once the network is defined. Relational sampling in define-by-run frameworks
In define-and-run frameworks, computation is conducted in two
phases: (1) construction phase and (2) evaluation phase. In a One valid claim about the advantage of the old define-and-run
way, contemporary optimization methods like Hyperopt are built optimization design is that the program is given the knowledge
on the philosophy similar to define-and-run, because there are of the concurrence relations among the hyperparamters from the
two phases in their optimization: (1) construction of the search beginning of the optimization process. Implementing of opti-
space and (3) exploration in the search space. mization methods that takes the concurrence relations among
the parameters into account is a nontrivial challenge when the
Because of their difficulty of programming, the define-and-run
search spaces are dynamically constructed. To overcome this
style deep learning frameworks are quickly being replaced by
challenge, Optuna features an ability to identify trial results that
define-by-run style deep learning frameworks like Chainer [13],
are informative about the concurrence relations. This way, the
DyNet [14], PyTorch [15], eager-mode TensorFlow [16], and
framework can identify the underlying concurrence relations
Gluon. In the define-by-run style DL framework, there are no
after some number of independent samplings, and use the in-
two separate phases for the construction of the network and the
ferred concurrence relation to conduct user-selected relational
computation on the network. Instead, the user is allowed to
sampling algorithms like CMA-ES [17] and GP-BO [18]. Being
directly program how each variables are to be manipulated in
an open source software, Optuna also allows the user to use
the network. What we propose in this article is an analogue of
his/her own customized sampling procedure.
the define-by-run DL framework for hyperparameter optimiza-
tion, in which the framework asks the user to directly program
the parameter search-space (See Table 1) . Armed with the 3.2 Efficient Pruning Algorithm
architecture built on the define-by-run principle, our Optuna can
Pruning algorithm is essential in ensuring the ”cost” part of the
express highly sophisticated search space at ease.
cost-effectiveness. Pruning mechanism in general works in two
phases. It (1) periodically monitors the intermediate objective
3 Efficient Sampling and Pruning Mechanism values, and (2) terminates the trial that does not meet the prede-
fined condition. In Optuna, ‘report API’ is responsible for the
In general, the cost-effectiveness of hyperparameter optimiza- monitoring functionality, and ‘should prune API’ is respon-
tion framework is determined by the efficiency of (1) searching sible for the premature termination of the unpromising trials
strategy that determines the set of parameters that shall be inves- (see Figure 5). The background algorithm of ‘should prune’
tigated, and (2) performance estimation strategy that estimates method is implemented by the family of pruner classes. Op-
the value of currently investigated parameters from learning tuna features a variant of Asynchronous Successive Halving
curves and determines the set of parameters that shall be dis- algorithm [19] , a recently developed state of the art method
carded. As we will experimentally show later, the efficiency of that scales linearly with the number of workers in distributed
both searching strategy and performance estimation strategy is environment.
necessary for cost-effective optimization method.
Asynchronous Successive Halving(ASHA) is an extension of
The strategy for the termination of unpromising trials is often Successive Halving [20] in which each worker is allowed to
referred to as pruning in many literatures, and it is also well asynchronously execute aggressive early stopping based on pro-
known as automated early stopping [5] [7]. We, however, refer visional ranking of trials. The most prominent advantage of
to this functionality as pruning in order to distinguish it from the asynchronous pruning is that it is particularly well suited for
early stopping regularization in machine learning that exists as applications in distributional environment; because each worker
a countermeasure against overfitting. As shown in table 2, many does not have to wait for the results from other workers at each
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 5
1 import ...
2
3 def objective (trial ):
4 ...
5
6 lr = trial. suggest loguniform (’lr ’, 1e−5, 1e−1)
7 clf = sklearn . linear model . SGDClassifier (
learning rate =lr)
8 for step in range (100) :
9 clf. partial fit ( x train , y train , classes )
10
11 # Report intermediate objective value .
12 intermediate value = clf. score ( x val , y val )
13 trial. report ( intermediate value , step=step)
14
15 # Handle pruning based on the intermediate value
.
16 if trial . should prune (step):
17 raise TrialPruned ()
18
19 return 1.0 − clf. score ( x val , y val )
Figure 6: Overview of Optuna’s system design. Each worker
20
21 study = optuna . create study ()
executes one instance of an objective function in each study.
22 study. optimize ( objective ) The Objective function runs its trial using Optuna APIs. When
the API is invoked, the objective function accesses the shared
Figure 5: An example of implementation of a pruning algorithm storage and obtains the information of the past studies from the
with Optuna. An intermediate value is reported at each step of storage when necessary. Each worker runs the objective function
iterative training. The Pruner class stops unpromising trials independently and shares the progress of the current study via
based on the history of reported values. the storage.
1 import ...
2
3 def objective (trial ):
4 ...
5 return objective value
6
7 study name = sys.argv [1]
8 storage = sys.argv [2]
9
10 study = optuna . Study ( study name , storage )
11 study. optimize ( objective )
(b) Shell
Figure 7: Distributed optimization in Optuna. Figure (a) is the
optimization script executed by one worker. Figure (b) is an
example shell for the optimization with multiple workers in a
distributed environment.
the purpose, but also comes with multiple built-in optimization
algorithms including the mixture of independent and relational
sampling, which is not featured in currently existing frameworks.
For example, Optuna can use the mixture of TPE and CMA-ES.
We compared the optimization performance of the TPE+CMA-
ES against those of other sampling algorithms on a collection
of tests for black-box optimization [23, 24], which contains 56
test cases. We implemented four adversaries to compare against
TPE+CMA-ES: random search as a baseline method, Hyper-
opt [1] as a TPE-based method, SMAC3 [3] as a random-forest
based method, and GPyOpt as a Gaussian Process based method.
For TPE+CMA-ES, we used TPE for the first 40 steps and used
Figure 8: Optuna dashboard. This example shows the online CMA-ES for the rest. For the evaluation metric, we used the
transition of objective values, the parallel coordinates plot of best-attained objective value found in 80 trials. Following the
sampled parameters, the learning curves, and the tabular descrip- work of Dewancker et al. [24], we repeated each study 30 times
tions of investigated trials. for each algorithm and applied Paired Mann-Whitney U test
with α = 0.0005 to the results in order to statistically compare
TPE+CMA-ES’s performance against the rival algorithms.
Optuna’s new design thus significantly reduces the effort re- The results are shown in Figure 9. TPE+CMA-ES finds sta-
quired for storage deployment. This new design can be easily in- tistically worse solution than random search in only 1/56 test
corporated into a container-orchestration system like Kubernetes cases, performs worse than Hyperopt in 1/56 cases, and per-
as well. As we verify in the experiment section, the distributed forms worse than SMAC3 in 3/56 cases. Meanwhile, GPyOpt
computations conducted with our flexible system-design scales performed better than TPE+CMA-ES in 34/56 cases in terms of
linearly with the number of workers. Optuna is also an open the best-attained loss value. At the same time, TPE+CMA-ES
source software that can be installed to user’s system with one takes an order-of-magnitude less times per trial than GPyOpt.
command.
Figure 10 shows the average time spent for each test case.
TPE+CMA-ES, Hyperopt, SMAC3, and random search finished
5 Experimental Evaluation one study within few seconds even for the test case with more
than ten design variables. On the other hand, GPyOpt required
We demonstrate the efficiency of the new design-framework twenty times longer duration to complete a study. We see that the
through three sets of experiments. mixture of TPE and CMA-ES is a cost-effective choice among
current lines of advanced optimization algorithms. If the time of
5.1 Performance Evaluation Using a Collection of Tests evaluation is a bottleneck, the user may use Gaussian Process
based method as a sampling algorithm. We plan in near future
As described in the previous section, Optuna not only allows the to also develop an interface on which the user of Optuna can
user to use his/her own customized sampling procedure that suits easily deploy external optimization software as well.
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 7
7 Conclusions
The efficacy of Optuna strongly supports our claim that our new
design criteria for next generation optimization frameworks are
worth adopting in the development of future frameworks. The
define-by-run principle enables the user to dynamically construct
the search space in the way that has never been possible with
Figure 12: Distributed hyperparameter optimization process for previous hyperparameter tuning frameworks. Combination of
the minimization of average test errors of simplified AlexNet for efficient searching and pruning algorithm greatly improves the
SVHN dataset. The optimization was done with ASHA pruning. cost effectiveness of optimization. Finally, scalable and versatile
design allows users of various types to deploy the frameworks
for a wide variety of purposes. As an open source software,
Optuna itself can also evolve even further as a next generation
of size 10KB each, and used Optuna to look for parameter-set
software by interacting with open source community. It is our
that minimizes the computation time required for applying a
strong hope that the set of design techniques we developed for
certain set of operations(store, search, delete) to this file set. Out
Optuna will serve as a basis of other next generation optimiza-
of over hundred customizable parameters, we used Optuna to
tion frameworks to be developed in the future.
explore the space of 34 parameters. With the default parameter
setting, RocksDB takes 372seconds on HDD to apply the set of
operation to the file set. With pruning, Optuna was able to find Acknowledgement. The authors thank R. Calland, S. Tokui,
a parameter-set that reduces the computation time to 30 seconds. H. Maruyama, K. Fukuda, K. Nakago, M. Yoshikawa, M. Abe, H. Ima-
Within the same 4 hours, the algorithm with pruning explores mura, and Y. Kitamura for valuable feedback and suggestion.
937 sets of parameters while the algorithm without pruning only
explores 39. When we disable the time-out option for the eval-
uation process, the algorithm without pruning explores only 2 References
trials. This experiment again verifies the crucial role of pruning. [1] James Bergstra, Brent Komer, Chris Eliasmith, Dan
Yamins, and David D Cox. Hyperopt: a Python library for
Encoder Parameters for FFmpeg. FFmpeg7 is a multimedia model selection and hyperparameter optimization. Compu-
framework that is widely used in the world for decoding, en- tational Science & Discovery, 8(1):14008, 2015.
coding and streaming of movies and audio dataset. FFmpeg
has numerous customizable parameters for encoding. However, [2] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Prac-
finding of good encoding parameter-set for FFmpeg is a non- tical bayesian optimization of machine learning algorithms.
trivial task, as it requires expert knowledge of codec. We used In NIPS, pages 2951–2959, 2012.
Optuna to seek the encoding parameter-set that minimizes the [3] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown.
reconstruction error for the Blender Open Movie Project’s ”Big Sequential model-based optimization for general algorithm
7 8
https://www.ffmpeg.org/ Blender Foundation — www.blender.org
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 9
configuration. In LION, pages 507–523, 2011. ISBN 978- [16] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen,
3-642-25565-6. Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
[4] Patrick Koch, Oleg Golovidov, Steven Gardner, Brett Wu- mawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
jek, Joshua Griffin, and Yan Xu. Autotune: A derivative- Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.
free optimization framework for hyperparameter tuning. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,
In KDD, pages 443–452, 2018. ISBN 978-1-4503-5552-0. Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng. TensorFlow: A system for large-scale machine
[5] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, learning. In OSDI, pages 265–283, 2016.
Greg Kochanski, John Karro, and D Sculley. Google
Vizier: A service for black-box optimization. In KDD, [17] Nikolaus Hansen and Andreas Ostermeier. Completely
pages 1487–1495, 2017. ISBN 978-1-4503-4887-4. derandomized self-adaptation in evolution strategies. Evo-
lutionary Computation, 9(2):159–195, 2001.
[6] James Bergstra, Rémi Bardenet, Yoshua Bengio, and
Balázs Kégl. Algorithms for hyper-parameter optimization. [18] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P
In NIPS, pages 2546–2554, 2011. Adams, and Nando De Freitas. Taking the human out of
the loop: A review of bayesian optimization. Proceedings
[7] Richard Liaw, Eric Liang, Robert Nishihara, Philipp of the IEEE, 104(1):148–175, 2016.
Moritz, Joseph E. Gonzalez, and Ion Stoica. Tune: A
research platform for distributed model selection and train- [19] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekate-
ing. In ICML Workshop on AutoML, 2018. rina Gonina, Moritz Hardt, Benjamin Recht, and Ameet
Talwalkar. Massively parallel hyperparameter tuning. In
[8] Tobias Domhan, Jost Tobias Springenberg, and Frank Hut- NeurIPS Workshop on Machine Learning Systems, 2018.
ter. Speeding up automatic hyperparameter optimization of
deep neural networks by extrapolation of learning curves. [20] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best
In IJCAI, pages 3460–3468, 2015. arm identification and hyperparameter optimization. In
Artificial Intelligence and Statistics, pages 240–248, 2016.
[9] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and
Frank Hutter. Learning curve prediction with Bayesian [21] Wes McKinney. Pandas: a foundational python library for
neural networks. In ICLR, 2017. data analysis and statistics. In SC Workshop on Python for
High Performance and Scientific Computing, 2011.
[10] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Ros-
tamizadeh, and Ameet Talwalkar. Hyperband: A novel [22] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez,
bandit-based approach to hyperparameter optimization. Brian Granger, Matthias Bussonnier, Jonathan Frederic,
Journal of Machine Learning Research, 18(185):1–52, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Cor-
2018. lay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol
Willing. Jupyter notebooks – a publishing format for re-
[11] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey producible computational workflows. In F. Loizides and
Tumanov, Richard Liaw, Eric Liang, William Paul, B. Schmidt, editors, Positioning and Power in Academic
Michael I. Jordan, and Ion Stoica. Ray: A dis- Publishing: Players, Agents and Agendas, pages 87 – 90.
tributed framework for emerging AI applications. CoRR, IOS Press, 2016.
abs/1712.05889, 2017. URL http://arxiv.org/abs/
1712.05889. [23] Michael McCourt. Benchmark suite of test functions
suitable for evaluating black-box optimization strategies.
[12] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, https://github.com/sigopt/evalset, 2016.
editors. Automatic Machine Learning: Methods, Sys-
tems, Challenges. Springer, 2018. In press, available [24] Ian Dewancker, Michael McCourt, Scott Clark, Patrick
at http://automl.org/book. Hayes, Alexandra Johnson, and George Ke. A strategy for
ranking optimization methods using multiple criteria. In
[13] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. ICML Workshop on AutoML, pages 11–20, 2016.
Chainer: a next-generation open source framework for
deep learning. In NIPS Workshop on Machine Learning [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Systems, 2015. ImageNet classification with deep convolutional neural
networks. In NIPS, pages 1097–1105, 2012.
[14] Graham Neubig, Chris Dyer, Yoav Goldberg, Austin
Matthews, Waleed Ammar, Antonios Anastasopoulos, [26] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-
Miguel Ballesteros, David Chiang, Daniel Clothiaux, sacco, Bo Wu, and Andrew Y Ng. Reading digits in nat-
Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, ural images with unsupervised feature learning. In NIPS
Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kun- Workshop on Deep Learning and Unsupervised Feature
coro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Learning, 2011.
Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha [27] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R.
Swayamdipta, and Pengcheng Yin. Dynet: The dynamic Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali,
neural network toolkit. CoRR, abs/1701.03980, 2017. Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio
[15] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Ferrari. The open images dataset V4: unified image classi-
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- fication, object detection, and visual relationship detection
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic at scale. CoRR, abs/1811.00982, 2018.
differentiation in PyTorch. In NIPS Autodiff Workshop, [28] Takuya Akiba, Tommi Kerola, Yusuke Niitani, Toru
2017. Ogawa, Shotaro Sano, and Shuji Suzuki. PFDet: 2nd
Preprint – Optuna: A Next-generation Hyperparameter Optimization Framework 10