0% found this document useful (0 votes)
17 views

howtotune_hyperp_inRL

This paper discusses the importance of hyperparameter optimization (HPO) in deep reinforcement learning (RL) and proposes best practices to improve reproducibility and performance. It highlights how hyperparameter choices significantly impact an agent's performance and efficiency, and compares various state-of-the-art HPO methods against traditional hand-tuning approaches. The authors provide open-source implementations of their recommended tuning algorithms and emphasize the need for the RL community to adopt these practices for better empirical results and reduced computational costs.

Uploaded by

guoleo361
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

howtotune_hyperp_inRL

This paper discusses the importance of hyperparameter optimization (HPO) in deep reinforcement learning (RL) and proposes best practices to improve reproducibility and performance. It highlights how hyperparameter choices significantly impact an agent's performance and efficiency, and compares various state-of-the-art HPO methods against traditional hand-tuning approaches. The authors provide open-source implementations of their recommended tuning algorithms and emphasize the need for the RL community to adopt these practices for better empirical results and reduced computational costs.

Uploaded by

guoleo361
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Hyperparameters in Reinforcement Learning and How To Tune Them

Theresa Eimer 1 * Marius Lindauer 1 Roberta Raileanu 2

Abstract
In order to improve reproducibility, deep rein-
forcement learning (RL) has been adopting better
scientific practices such as standardized evalua-
arXiv:2306.01324v1 [cs.LG] 2 Jun 2023

tion metrics and reporting. However, the process


of hyperparameter optimization still varies widely
across papers, which makes it challenging to com-
pare RL algorithms fairly. In this paper, we show
that hyperparameter choices in RL can signifi-
cantly affect the agent’s final performance and
sample efficiency, and that the hyperparameter
landscape can strongly depend on the tuning seed
which may lead to overfitting. We therefore pro- Figure 1: Comparison of Hyperparameter Tuning Approaches:
pose adopting established best practices from Au- state-of-the-art hyperparameter optimization packages match or
toML, such as the separation of tuning and testing outperform hand tuning via grid search, while using less than 1/12
seeds, as well as principled hyperparameter opti- of the budget.
mization (HPO) across a broad search space. We
support this by comparing multiple state-of-the-
art HPO tools on a range of RL algorithms and sions and implementation details have received greater at-
environments to their hand-tuned counterparts, tention in the last years (Henderson et al., 2018; Engstrom
demonstrating that HPO approaches often have et al., 2020; Hsu et al., 2020; Andrychowicz et al., 2021;
higher performance and lower compute overhead. Obando-Ceron & Castro, 2021), the same is less true of RL
As a result of our findings, we recommend a set hyperparameters. Progress in self-adapting algorithms (Za-
of best practices for the RL community, which havy et al., 2020), RL-specific hyperparameter optimization
should result in stronger empirical results with tools (Franke et al., 2020; Wan et al., 2022), and meta-learnt
fewer computational costs, better reproducibility, hyperparameters (Flennerhag et al., 2022) has not yet been
and thus faster progress. In order to encourage the adopted by RL practitioners. In fact, most papers only report
adoption of these practices, we provide plug-and- final model hyperparameters or grid search sweeps known
play implementations of the tuning algorithms to be suboptimal and costly compared to even simple Hy-
used in this paper at https://github.com/ perparameter Optimization (HPO) baselines like random
facebookresearch/how-to-autorl. search (Bergstra & Bengio, 2012). In addition, the seeds
used for tuning and evaluation are rarely reported, leaving it
unclear if the hyperparameters were tuned on the test seeds,
1. Introduction which is – as we will show – a major reproducibility issue.
In this paper, we aim to lay out and address the potential
Deep reinforcement Learning (RL) algorithms contain a causes for the lack of adoption of HPO methods in the RL
number of design decisions and hyperparameter settings, community.
many of which have a critical influence on the learning
speed and success of the algorithm. While design deci- Underestimation of Hyperparameter Influence While it
has been previously shown that hyperparameters are impor-
*
Work was done during an internship at Meta AI. 1 Leibniz tant to an RL algorithm’s success (Henderson et al., 2018;
University Hannover 2 Meta AI. Correspondence to: Theresa Eimer Engstrom et al., 2020; Andrychowicz et al., 2021), the im-
<[email protected]>. pact of even seemingly irrelevant hyperparameters is still un-
Proceedings of the 40 th International Conference on Machine derestimated by the community, as indicated by the fact that
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright many papers tune only two or three hyperparameters (Schul-
2023 by the author(s). man et al., 2017; Berner et al., 2019; Badia et al., 2020;

1
Hyperparameters in RL and How To Tune Them 2

Hambro et al., 2022). We show that even often overlooked To summarize, our contributions are:
hyperparameters can make or break an algorithm’s success,
meaning that careful consideration is necessary for a broad 1. Exploration of the hyperparameter landscape for
range of hyperparameters. This is especially important for commonly-used RL algorithms and environments;
as-of-yet unexplored domains, as pointed out by Zhang et al.
2. Comparison of different types of HPO methods on
(2021a). Furthermore, hyperparameters cause different al-
state-of-the-art RL algorithms and challenging RL en-
gorithm behaviors depending on the random seed which is
vironments;
a well-known fact in AutoML (Eggensperger et al., 2019;
Lindauer & Hutter, 2020) but has not yet factored widely 3. Open-source implementations of advanced HPO meth-
into RL research, negatively impacting reproducibility. ods that can easily be used with any RL algorithm and
Fractured State of the Art in AutoRL Even though HPO environment; and
approaches have succeeded in tuning RL algorithms (Franke 4. Best practice recommendations for HPO in RL.
et al., 2021; Awad et al., 2021; Zhang et al., 2021a; Wan
et al., 2022), the costs and benefits of HPO are relatively
unknown in the community. AutoRL papers often compare
2. The Hyperparameter Optimization Problem
only a few HPO methods, are limited to single domains or We provide an overview of the most relevant formaliza-
toy problems, or use a single RL algorithm (Jaderberg et al., tions of HPO in RL, Algorithm Configuration (Schede et al.,
2017; Parker-Holder et al., 2020; Awad et al., 2021; Kiran & 2022) and Dynamic Algorithm Configuration (Adriaensen
Ozyildirim, 2022; Wan et al., 2022). In this work, we aim to et al., 2022). Algorithm Configuration (AC) is a popular
understand the need for and challenges of AutoRL by com- paradigm for optimizing hyperparameters of several differ-
paring multiple HPO methods across various state-of-the-art ent kinds of algorithms (Eggensperger et al., 2019).
RL algorithms on challenging environments. Our results Definition 2.1 (AC). Given an algorithm A, a hyperparam-
demonstrate that HPO approaches have better performance eter space Λ , as well as a distribution of environments or
and less compute overhead than hyperparameter sweeps or environment instances I, and a cost function c, find the
grid searches which are typically used in the RL community optimal configuration λ ∗ ∈ Λ across possible tasks s.t.:
(see Figure 1).
λ ∗ ∈ arg minλ ∈Λ
Λ Ei∼I [c(A(i; λ ))].
Ease of Use State-of-the-art AutoML tools are often re-
leased as research papers rather than standalone packages. The cost function could be the negative of the agent’s reward
In addition, they are not immediately compatible with stan- or a failure indicator across a distribution of tasks. Thus
dard RL code, while easy to use solutions like Optuna (Ak- it is quite flexible and can accommodate a diverse set of
iba et al., 2019) or Ax (Bakshy et al., 2018) only provide a possible goals for algorithm performance. This definition
limited selection of HPO aproaches. To improve the avail- is not restricted to one train and test setting but aims to
ability of these tools, we provide Hydra sweepers (Yadan, achieve the best possible performance across a range of en-
2019) for several variations of population-based methods, vironments or environment instances. AC approaches thus
such as standard PBT (Jaderberg et al., 2017), PB2 (Parker- strive to avoid overfitting the hyperparameters to a specific
Holder et al., 2020) and BGT (Wan et al., 2022), as well scenario. Even for RL problems focusing on generalization,
as the evolutionary algorithm DEHB (Awad et al., 2021). AC is therefore a suitable framework. Commonly, the HPO
Note that all of these have shown to improve over random process is terminated before we have found the true λ ∗ via
search for tuning RL agents. As black-box methods, they an optimization budget (e.g. the runtime or number of train-
are compatible with any RL algorithm or environment and ing steps). The best found hyperparameter configuration
due to Hydra, users do not have to change their implementa- found by the optimization process is called the incumbent.
tion besides returning a success metric like the reward once
training is finished. Based on our empirical insights, we Another relevant paradigm for tuning RL is Dynamic Algo-
provide best practice guidelines on how to use HPO for RL. rithm Configuration (DAC) (Biedenkapp et al., 2020; Adri-
aensen et al., 2022). DAC is a generalization of AC that
In this paper, we demonstrate that compared to tuning does not search for a single optimal hyperparameter value
hyperparameters by hand, existing HPO tools are ca- per algorithm run but instead for a sequence of values.
pable of producing better performing, more stable, and Definition 2.2 (DAC). Given an algorithm A, a hyperpa-
more easily comparable RL agents, while using fewer rameter space Λ as well as a distribution of environments or
computational resources. We believe widespread adoption environment instances I with state space S, cost function c
of HPO protocols within the RL community will therefore and a space of dynamic configuration policies Π with each
result in more accurate and fair comparisons across RL π ∈ Π : S × I → Λ , find π ∗ ∈ Π s.t.:
methods and in the end to faster progress.
π ∗ ∈ arg minπ∈Π Ei∼I c(A(i; π))

2
Hyperparameters in RL and How To Tune Them 3

As RL is a dynamic optimization process, it can benefit significant performance and efficiency gains, offering a RL-
from dynamic changes in the hyperparameter values such as specific way of optimizing hyperparameters during training.
learning rate schedules (Zhang et al., 2021a; Parker-Holder A benefit of PBT methods is that they implicitly find a
et al., 2022). Thus HPO tools developed specifically for RL schedule of hyperparameter settings instead of a fixed value.
have been following the DAC paradigm in order to tailor
Beyond PBT methods, many general AC algorithms have
the hyperparameter values closer to the training progress
proven to perform well on ML and RL tasks (Schede et al.,
(Franke et al., 2020; Zhang et al., 2021a; Wan et al., 2022).
2022). A few such examples are SMAC (Lindauer et al.,
It is worth noting that while the model architecture can be 2022) and DEHB (Awad et al., 2021) which are based on
defined by a set of hyperparameters like the number of lay- Bayesian Optimization and evolutionary algorithms, respec-
ers, architecture search is generally more complex and thus tively. SMAC is model-based (i.e. it learns a model of the
separately defined as the NAS problem or combined with hyperparameter landscape using a Gaussian process) and
HPO to form the general AutoDL problem (Zimmer et al., both are multi-fidelity methods (i.e. they utilize shorter
2021). While some tools include options for optimizing training runs to test many different configurations, only
architecture hyperparameters, insights into how to find good progressing the best ones). While these algorithms have
architectures for RL are out of scope for this paper. rarely been used in RL so far, there is no evidence to sug-
gest they perform any worse than RL-specific optimization
3. Related Work approaches. In fact, a possible advantage of multi-fidelity
approaches over population-based ones is that given the
While RL as a field has seen many innovations in the last same budget, multi-fidelity methods see a larger number
years, small changes to the algorithm or its implemen- of total configurations, while population-based ones see a
tation can have a big impact on its results (Henderson smaller number of configurations trained for a longer time.
et al., 2018; Andrychowicz et al., 2021; Engstrom et al.,
2020). In an effort to consolidate these innovations, sev- 4. The Hyperparameter Landscape of RL
eral papers have examined the effect of smaller design de-
cisions like the loss function or policy regularization for Before comparing HPO algorithms, we empirically motivate
on-policy algorithms (Hsu et al., 2020; Andrychowicz et al., why using dedicated tuning tools is important in RL. To this
2021), DQN (Obando-Ceron & Castro, 2021) and offline end we study the effect of hyperparameters as well as that of
RL (Zhang & Jiang, 2021). AutoRL methods, on the other the random seed on the final performance of RL algorithms.
hand, have focused on automating and abstracting some We also investigate the smoothness of the hyperparameter
of these decisions (Parker-Holder et al., 2022) by using space. The goal of this section is not to achieve the best
data-driven approaches to learn various algorithmic compo- possible results on each task but to gather insights into
nents (Bechtle et al., 2020; Xu et al., 2020; Metz et al., 2022) how hyperparameters affect RL algorithms and how we can
or even entire RL algorithms (Wang et al., 2016; Duan et al., optimize them effectively.
2016; Co-Reyes et al., 2021; Lu et al., 2022).
Experimental Setup To gain robust insights into the im-
While overall there has been less interest in hyperparameter pact of hyperparameters on the performance of an RL
optimization, some RL-specific HPO algorithms have been agent, we consider a range of widely-used environments
developed. STACX (Zahavy et al., 2020) is an example of a and algorithms. We use basic gym environments such as
self-tuning algorithm, using meta-gradients (Xu et al., 2018) OpenAI’s Pendulum and Acrobot (Brockman et al., 2016),
to optimize its hyperparameters during runtime. This idea gridworld with an exploration component such as Mini-
has recently been generalized to bootstrapped meta-learning, Grid’s Empty and DoorKey 5x5 (Chevalier-Boisvert et al.,
enabling the use of meta-gradients to learn any combination 2018), as well as robot locomotion tasks such as Brax’s
of hyperparameters on most RL algorithms on the fly (Flen- Ant, Halfcheetah and Humanoid (Freeman et al., 2021). We
nerhag et al., 2022). Such gradient-based approaches are use PPO (Schulman et al., 2017) and DQN (Mnih et al.,
fairly general and have shown a lot of promise (Paul et al., 2015) for the discrete environments, and PPO as well as
2019). However, they require access to the algorithm’s SAC (Haarnoja et al., 2018) for the continuous ones, all
gradients, thus limiting their use and incurring a larger com- in their StableBaselines3 implementations (Raffin
pute overhead. In this paper, we focus on purely black-box et al., 2021). This selection is representative of the main
methods for their ease of use in any RL setting. classes of model-free RL algorithms (i.e. on-policy policy-
optimization, off-policy value-based, and off-policy actor-
Extensions of population-based training (PBT) (Jaderberg
critic) and covers a diverse set of tasks posing different
et al., 2017; Li et al., 2019) improvements like BO ker-
challenges (i.e. discrete and continuous control), allowing
nels (Parker-Holder et al., 2020) or added NAS compo-
us to draw meaningful and generalizable conclusions.
nents (Franke et al., 2020; Wan et al., 2022) have led to

3
Hyperparameters in RL and How To Tune Them 4

For each environment, we sweep over 8 hyperparameters


for DQN, 7 for SAC and 11 for PPO (for a full list, see
Appendix E). We run each combination of hyperparameter
setting, algorithm and environment for 5 different random
seeds. For brevity’s sake, we focus on the PPO results in
the main paper. The results on the other algorithms lead to
similar conclusions and can be found in Appendix H.
For the tuning insights in this section, we use random search
(RS) in its Optuna implementation (Akiba et al., 2019), a
multi-fidelity method called DEHB (Awad et al., 2021) and
a PBT approach called PB2 (Parker-Holder et al., 2020).
Although grid search is certainly more commonly-used in
RL than RS, we do not include it as a baseline due to its
major disadvantages relative to RS such as its poor scal- Figure 2: Hyperparameter landscapes of learning rate, clip range
ing with the size of the search space and heavy reliance and entropy coefficient for PPO on Brax and MiniGrid. For each
on domain knowledge (Bergstra & Bengio, 2012). We hyperparameter value, we report the average final return and stan-
dard deviation across 5 seeds.
choose DEHB and PB2 as two standard incarnations of
multi-fidelity and PBT methods without any extensions like
run initialization (Wan et al., 2022) or configuration rac-
ing (Lindauer et al., 2022) because we want to test how (e.g.fro PPO learning rate on Acrobot, clip range on Pen-
well lightweight vanilla versions of these algorithm classes dulum and the GAE lambda on MiniGrid - see Appendix I
perform on RL.We use a total budget of 10 full RL runs for for full results). Additionally, Partial Dependency Plots on
all methods. For more background on these methods as well Pendulum and Acrobot (see Appendix J) show that there are
as their own hyperparameter settings, see Appendix C. A almost no complex interaction patterns between the hyper-
complete overview of search spaces and experiment settings parameters which would increase the difficulty when tuning
can be found in Appendix D. The code for all experiments all of them at the same time. Since most hyperparameters
in this paper can be found at https://github.com/ have significant influences on performance, their impor-
facebookresearch/how-to-autorl. tance varies across environments and there are only few
interference effects, we recommend tuning as many hyper-
4.1. Which RL Hyperparameters Should Be Tuned? parameters as possible – as is best practice in the AutoML
community (Eggensperger et al., 2019).
Our goal is not to find good default hyperparameter settings
(see Appendix E for our reasoning) or gain insights into This result suggests that common grid search approaches
why some configurations perform a certain way. Instead, we are likely suboptimal as good search space coverage along
are interested in their general relevance, i.e., the effect size many dimensions is highly expensive (Bergstra & Bengio,
for hyperparameter tuning. Thus, we run sweeps over our 2012). In order to empirically test if current HPO tools are
chosen hyperparameters for each environment and algorithm well suited to such a set of diverse hyperparameters, we
to get an impression of which hyperparameters are important tune our algorithms using differently sized search spaces:
in each setting. See Appendix G for the full results. (i) only the learning rate (which could be hand-tuned), (ii)
a small space with three hyperparameters (which would be
In Figure 2, we can see a large influence on the final per- expensive but possible to tune manually) and (iii) the full
formance of almost every hyperparameter we sweep over search space of 7 hyperparameters for SAC, 9 for DQN, and
for each environment. Even the rarely tuned clip range can 11 for PPO (which is too large to feasibly tune by hand -
be a deciding factor between an agent succeeding or failing sweeping 7 hyperparameters with only three values amounts
in an environment such as in Ant. In many cases, hyper- to a grid search of 2187 runs).
parameters can also have a large effect on the algorithm’s
stability throughout training. In total, we observed only the In Table 1 we see that RS performs well on Acrobot but
worst hyperparameter choice being within the best choice’s it falls short on Pendulum, displaying large discrepancies
standard deviation 7 times out of 126 settings and only 13 across seeds, some performing well, and some failing to
times the median performance dropping by less than 20%. find a good configuration. While this is a typical failure case
At the same time, hyperparameter importance analysis using of RS, this does not mean RS is a weak candidate, rank-
fANOVA (Hutter et al., 2014) shows that one or two hyper- ing second overall by outperforming PB2 in several cases.
parameters monopolize the importance on each environment PB2 is also quite unreliable: on Acrobot, its performance
- though they tend to vary from environment to environment decreases with the size of the search space; on Pendulum,
however, it improves with the size of the search space. As

4
Hyperparameters in RL and How To Tune Them 5

Table 1: Tuning PPO on Acrobot (top) and SAC on Pendulum


(bottom) across different search space sizes (i.e. only learning
rate, {learning rate, entropy coefficient, training epochs}, and full
search space). Shown is the negative evaluation reward across 5
tuning runs. Lower numbers are better, best performance on each
environment is highlighted. The best final performance on a single
seed from our sweeps is also reported.

DEHB Inc. PB2 Inc. RS Inc. Sweep


LR Only 71 ± 1 94 ± 22 78 ± 5 81
Pendulum Acrobot

Small 72 ± 1 193 ± 160 80 ± 6


Full 71 ± 3 305 ± 186 83 ± 5
LR Only 71 ± 12 207 ± 126 89 ± 25 117
Small 119 ± 12 106 ± 12 401 ± 363
Full 112 ± 24 78 ± 19 144 ± 48

with RS, part of the underlying issue is the inconsistent per-


formance of PB2. Note that the incumbent configuration is
fairly static across all PB2 runs for the larger search spaces.
In most cases, the configuration changes at most once dur-
ing training, showing that PB2 currently does not take full Figure 3: Hyperparameter Sweeps for PPO across learning rates,
advantage of its ability to find dynamic schedules. DEHB is entropy coefficients and clip ranges on various environments. The
the most stable in terms of standard deviation across seeds, mean and standard deviation are computed across 5 seeds.
even though we see a slight decrease in performance on
Pendulum with larger search spaces.
Overall we see that finding well performing configurations
across large search spaces is usually possible even with
a simple algorithm like RS. All methods deliver reason-
able hyperparameter configurations across a large search
space, especially given all of them use only 10 full training
runs. On this small budget, they are able to match or out-
perform the single best seeds in all our sweep runs which
use a total of 125 runs per environment. Our experiments
show that automatically tuning a large variety of hyperpa-
rameters is both beneficial and efficient using even simple Figure 4: Individual seeds for selected clip range and entropy
algorithms like RS or vanilla instantiations of multi-fidelity coefficient values of PPO across various environments.
and population-based methods.
consistently with regards to one another during the runtime,
4.2. Are Hyperparameters in RL Well Behaved?
i.e., good configurations tend to learn quickly and bad con-
In addition to the large number of hyperparameters con- figurations decay soon after training begins. This means
tributing to an algorithm’s performance, how an algorithm HPO approaches utilizing partial algorithm runs to measure
behaves with respect to changing hyperparameter values is the quality of configurations like multi-fidelity methods or
an important factor in tuning algorithms. Ideally, we want PBT should not face major issues tuning RL algorithms.
the algorithm’s performance to be predictable, i.e., if the
While we do see large variability in some configurations,
hyperparameter value is close to the optimum, we want
this issue seems to occur largely in medium-well perform-
the agent to perform well and then become progressively
ing configurations, not in the very best or worst ones (see
worse the farther we move away – in essence, a smooth
Figure 3). This supports our claim that hyperparameters
optimization landscape (Pushak & Hoos, 2022). As we can
are not only useful in increasing performance but have a
see in Figure 2, the transitions between different parts of the
significant influence on algorithm variability.
search space are fairly smooth. The configurations perform
in the order we would expect them to given the best values, During the run itself differences between seeds can become
with the drops in performance being mostly gradual instead an issue, however, especially for methods using partial runs.
of sudden. Figure 3 shows configurations also performing On many environments, when looking at each seed indi-

5
Hyperparameters in RL and How To Tune Them 6

Table 2: Tuning PPO on Acrobot (top) and SAC on Pendulum (bottom) across the full search space and different numbers of seeds. Lower
numbers are better, best test performance for each method and values within its standard deviation are highlighted. Test performances are
aggregated across 10 separate test seeds using the mean for each tuning run. We report mean and standard deviation of these.

DEHB Inc. DEHB Test PB2 Inc. PB2 Test RS Inc. RS Test
1 Seed 70.6 ± 3.4 341.3 ± 183.1 305.3 ± 185.5 353.7 ± 134.5 77.8 ± 4.9 136.8 ± 70.5
Acrobot

3 Seeds 76.2 ± 0.9 381.1 ± 127.6 301.2 ± 128.0 411.3 ± 117.9 88.2 ± 5.7 98.8 ± 16.3
5 Seeds 79.3 ± 1.2 465.1 ± 24.6 228.5 ± 149.5 471.8 ± 19.1 89.2 ± 10.4 116.8 ± 43.3
10 Seeds 156.0 ± 24.5 464.8 ± 36.5 404.9 ± 53.3 474.4 ± 23.5 108.3 ± 28.2 100.1 ± 20.0
Pendulum

1 Seed 111.5 ± 23.6 150.5 ± 13.4 77.8 ± 19.0 840.7 ± 580.1 88.6 ± 24.9 168.3 ± 46.4
3 Seeds 125.0 ± 23.2 144.8 ± 9.0 133.3 ± 14.7 171.0 ± 35.5 150.7 ± 13.9 159.0 ± 21.6
5 Seeds 127.3 ± 11.5 350.2 ± 418.2 134.0 ± 22.1 661.3 ± 586.2 134.8 ± 9.8 397.8 ± 485.5
10 Seeds 742.4 ± 498.8 318.6 ± 281.3 282.0 ± 252.9 468.6 ± 437.9 144.5 ± 17.9 150.2 ± 4.8

vidually per hyperparameter as in Figure 4, we can see the the learning rate on Humanoid). In most cases, there is an
previously predictable behaviour is replaced with signifi- overlap between adjacent configurations, so it is certainly
cant differences across seeds. We observe single seeds with possible to select a presumably well-performing hyperpa-
crashing performance, inconsistent learning curves and also rameter configuration on one seed that has low average
exceptionally well performing seeds that end up outperform- performance across others.
ing the best seeds of configurations which are better on
As this is a known issue in other fields as well, albeit not
average. Given that we believe tuning only a few seeds of
to the same degree as in RL, it is common to evaluate a
the target RL algorithm is still the norm (Schulman et al.,
configuration on multiple seeds in order to achieve a more
2017; Berner et al., 2019; Raileanu & Rocktäschel, 2020;
reliable estimate of the true performance (Eggensperger
Badia et al., 2020; Hambro et al., 2022), such high variabil-
et al., 2018). We verify this for RL by comparing the final
ity with respect to the seed is likely a bigger difficulty factor
performance of agents tuned by DEHB and PB2 on the
for HPO in RL than the optimization landscape itself.
performance mean across a single, 3 or 5 seeds. We then
Thus, our conclusion is somewhat surprising: it should be test the overall best configuration on 5 unseen test seeds.
possible to tune RL hyperparameters just as well as the
Table 2 shows that RS is able to improve the average test
ones in any other fields without RL-specific additions to
performance on both Acrobot and Pendulum by increasing
the tuning algorithm since RL hyperparameter landscapes
the number of tuning seeds, as are DEHB and PB2 on Pendu-
appear to be rather smooth. The large influence of many
lum. However, this is only true up to a point as performance
different hyperparameters is a potential obstacle, however,
estimation across more than 3 seeds leads to a general de-
as are interaction effects that can ocurr between hyperpa-
crease in test performance, as well as a sharp increase in
rameters. Furthermore, RL’s sensitivity to the random seed,
variance in some cases (e.g. 5 seed RS or 10 seed PB2 on
can present a challenge in tuning its hyperparameters, both
Pendulum). Especially when tuning across 10 seeds, we see
by hand and in an automated manner.
that the incumbents suffer as well, indicating that evaluating
the configurations across multiple seeds increases the diffi-
4.3. How Do We Account for Noise? culty of the HPO problem substantially, even though it can
As the variability between random seeds is a potential source help avoid overfitting. The performance difference between
of error when tuning and running RL algorithms, we in- tuning and testing is significant in many cases and we can
vestigate how we can account for it in our experiments to see e.g. on Acrobot that the best incumbent configurations,
generate more reliable performance estimates. found by DEHB, perform more than four times worse on
test seeds. We can find this effect in all tuning methods,
As we have seen high variability both in performance and especially on Pendulum. This presents a challenge for re-
across seeds for different hyperparameter values, we return producibility given that currently it is almost impossible to
to Figure 2 to investigate how big the seed’s influence on the know what seeds were used for tuning or evaluation. Simply
final performance really is. The plots show that the standard reporting the performance of tuned seeds for the proposed
deviation of the performance for the same hyperparameter method and that of testing seeds for the baselines is an unfair
configuration can be very large. While this performance comparison which can lead to wrong conclusions.
spread tends to decrease for configurations with better me-
dian performance, top-performing seeds can stem from un- To summarize, we have seen that the main challenges are
stable configurations with low median performance (e.g. the size of the search space, the variability involved in train-
ing RL agents, and the challenging generalization across

6
Hyperparameters in RL and How To Tune Them 7

Figure 5: Tuning Results for PPO on Brax. Shown is the mean evaluation reward across 10 episodes for 3 tuning runs as well as the 98%
confidence interval across tuning runs.

random seeds. Since many hyperparameters have a large for supervised machine learning, we expect that RS should
influence on agent performance, but the optimization land- be outperformed by the other approaches. For each task, we
scape is relatively smooth, RL hyperparameters can be effi- work on the original open-sourced code of each state-of-the-
ciently tuned using HPO techniques, as we have shown in art RL method we test against, using the manually tuned
our experiments. Manual tuning, however, is comparatively hyperparameter settings as recommended in the correspond-
costly as its cost scales at least linearly with the size of the ing papers as the baseline. All tuning algorithms will be
search space.Dedicated HPO tools, on the other hand are given a small budget of up to 16 full algorithm runs as well
able to find good configurations on a significantly smaller as a larger one of 64 runs. In comparison, IDAAC’s tuning
budget by searching the whole space. A major difficulty fac- budget is 810 runs. To give an idea of the reliability of both
tor, however, is the high variability of results across seeds, the tuning algorithm and the found configurations, we tune
which is an overlooked reproducibility issue that can lead each setting 3 times across 5 seeds and test the best-found
to distorted comparisons of RL algorithms. This problem configuration on 10 unseen test seeds.
can be alleviated by tuning the algorithms on multiple seeds
As shown in Figures 5 and 6, these domains are more chal-
and evaluating them on separate test seeds.
lenging to tune on our small budgets relative to our previous
environments (for tabular results, see Appendix F). While
5. Tradeoffs for Hyperparameter we do not know how the Brax baseline agent was tuned
Optimization in Practice as this is not reported in the paper, the IDAAC baseline
uses 810 runs which is 12 times more than the large tuning
While the experiments in the previous section are meant to budget used by our HPO methods. On Brax, DEHB out-
highlight what challenges HPO tools face in RL and how performs the baseline with a mean rank of 1.3 compared to
well they overcome them, we now turn to more complex 1.7 for the 16 run budget and a rank of 1.0 compared to the
use cases of HPO. To this end, we select three challeng- baselines’s 1.3 with 64 runs. On Procgen the comparison is
ing environments each from Brax (Freeman et al., 2021) similar with 1.7 to 2 for 16 runs and 1.0 to 1.3 for 64 runs
(Ant, Halfcheetah and Humanoid) and three from Proc- (see Appendix D.4 for details on how the rank is computed).
gen (Cobbe et al., 2020) (Bigfish, Climber and Plunder) We also see that DEHB’s incumbent and test scores improve
and automatically tune the state-of-the-art RL algorithms the most consistently out of all the tuning methods, with the
on these domains (PPO for Brax and IDAAC (Raileanu & additional run budget being utilized especially well on Brax.
Fergus, 2021) for Procgen). Our goal here is simple: we RS, as expected cannot match this performance, ranking 2.3
want to see if HPO tools can improve upon the state of the and 2.7 for 16 runs and 3.3 and 3 for 64 runs respectively.
art in these domains with respect to both final performance We also see poor scaling behavior in some cases e.g. RS
and compute overhead. As we now want to compare abso- with a larger budget overfits to the tuning seeds on Brax
lute performance values on a more complex problems with while failing to improve on Procgen. As above, we see an
a bigger budget, we use BGT (Wan et al., 2022) as the state- instance of PB2 performing around 5 times worse on the
of-the-art population-based approach, and DEHB since it test seeds compared to the incumbent on Bigfish, further
is among the best solver currently available (Eggensperger suggesting that certain PBT variants may struggle to gen-
et al., 2021). As before, we use RS as an example of a eralize in such settings. On the other environments it does
simple-to-implement tuning algorithm with minimal over- better, however, earning a Procgen rank of 2 on the 16 run
head. In view of the results of Turner et al. (2021) on HPO budget, matching the baseline. With a budget of 64 runs, it

7
Hyperparameters in RL and How To Tune Them 8

ranks 2.7, the same as BGT and above RS. BGT does not their best practices, as e.g. stated by Eggensperger et al.
overfit to the same degree as PB2 but performs worse on (2019) and Lindauer & Hutter (2020), and using their HPO
lower budgets, ranking 3.8 on Procgen for 16 runs and 2.7 tools which can lead to strong performance as shown in this
for 64. On Brax, it fails to find good configurations with paper. One notable good practice is to use separate seeds
the exception of a single run on Ant (rank 3). We do not for tuning and testing hyperparameter configurations. Other
restart the BGT optimization after a set amount of failures, examples include standardizing the tuning budget for the
however, in order to keep within our small maximum bud- baselines and proposed methods, as well as tuning on the
gets. The original paper indicates that it is likely BGT will training and not the test setting.While HPO in RL provides
perform much better given a less restrictive budget. unique challenges such as the dynamic nature of the train-
ing loop or the strong sensitivity to the random seed, we
Overall, HPO tools conceived for the AC setting, as rep-
observe significant improvements in both final performance
resented by DEHB, are the most consistent and reliable
and compute cost by employing state-of-the-art AutoML
within our experimental setting. Random Search, while not
approaches. This can be done by integrating multi-fidelity
a bad choice on smaller budgets, does not scale as well with
evaluations into the population-based framework or using
the number of tuning runs. Population-based methods can-
optimization tools like DEHB and SMAC.
not match either; PB2, while finding very well performing
incumbent configurations struggles with overfitting, while Integrate Tuning Into The Development Pipeline For fair
BGT would likely benefit from larger budgets than used here. comparisons and realistic views of RL methods, we have
Further research into this optimization paradigm that pri- to use competently tuned baselines. More specifically, the
orities general configurations over incumbent performance proposed method and baselines should use the same tun-
could lead to additional improvements. ing budget and be evaluated on test seeds which should be
different from the tuning seeds. Integrating HPO into RL
Across both benchmarks we see large discrepancies between
codebases is a major step towards facilitating such com-
the incumbent and test performance. This underlines our
parisons. Some RL frameworks have started to include
earlier point about the importance of using different test and
options for automated HPO (Huang et al., 2021; Liaw et al.,
tuning seeds for reporting. In terms of compute overhead,
2018) or provide recommended hyperparameters for a set
all tested HPO methods had negligible effects on the total
of environments (Raffin et al., 2021) (although usually not
runtime, with BGT, by far the most expensive one, utilising
how they were obtained). The choice of tuning tools for
on average under two minutes of time to produce new con-
each library is still relatively limited, however, while pro-
figurations for the 16 run budget and less than 2 hours for
vided hyperparameters are not always well documented and
the 64 run budget, with all other approaches staying under
typically do not transfer well to other environments or algo-
5 minutes in each budget. Overall, we see that even com-
rithms. Thus, we hope our versatile and easy-to-use HPO
putationally cheap methods with small tuning budgets can
implementations that can be applied to any RL algorithm
generally match or outperform painstakingly hand-tuned
and environment will encourage broader use of HPO in RL
configurations that use orders of magnitude more compute.
(see Appendix B for more information). In the future, we
hope more RL libraries include AutoRL approaches since
6. Recommendations & Best Practices in a closed ecosystem, more sophisticated methods that go
beyond black-box optimizers (e.g. gradient-based methods,
Our experiments show the benefit of comprehensive hyper-
neuro-evolution, or meta-learned hyperparameter agents à
parameter tuning in terms of both final performance and
la DAC) could be deployed more easily.
compute cost, as well as how common overfitting to the set
of tuning seeds is. As a result of our insights, we recommend A Recipe For Efficient RL Research To summarize, we
some good practices for HPO in RL going forward. recommend the following step-by-step process for tuning
and selecting hyperparameters in RL:
Complete Reporting We still find that many RL papers do
not state how they obtain their hyperparameter configura-
tions, if they are included at all. As we have seen, however, 1. Define a training and test set which can include:
unbiased comparisons should not take place on the same (a) environment variations
seeds the hyperparameters are tuned on. Hence, reporting (b) random seeds for non-deterministic environments
the tuning seeds, the test seeds, and the exact protocol used (c) random seeds for initial state distributions
for hyperparameter selection, should be standard practice to
(d) random seeds for the agent (including network
ensure a sound comparison across RL methods.
initialization)
Adopting AutoML Standards In many ways, the AutoML (e) training random seeds for the HPO tool
community is ahead of the RL community regarding hyper- 2. Define a configuration space with all hyperparameters
parameter tuning. We can leverage this by learning from that likely contribute to training success;

8
Hyperparameters in RL and How To Tune Them 9

Figure 6: Tuning Results for IDAAC on Procgen. Shown is the mean evaluation reward across 10 episodes for 3 tuning runs as well as the
98% confidence interval across tuning runs.

3. Decide which HPO method to use; parameter sweeps or grid searches. We provide versatile and
4. Define the limitations of the HPO method, i.e. the bud- easy-to-use implementations of these tools which can be
get (or use self-terminating (Makarova et al., 2022)); applied to any RL algorithm and environment. We hope this
5. Settle on a cost metric – this should be an evaluation re- will encourage the adoption of AutoML best practices by the
ward across as many episodes as a needed for a reliable RL community, which should enhance the reproducibility
performance estimate; of RL results and make solving new domains simpler.
6. Run this HPO method on the training set across a num- Nevertheless, there is a lot of potential for developing HPO
ber of tuning seeds; approaches tailored to the key challenges of RL such as
7. Evaluate the resulting incumbent configurations on the the high sensitivty to the random seed for a given hyper-
test set across a number of separate test seeds and parameter configuration. Frameworks for learnt hyperpa-
report the results. rameter policies or gradient-based optimization methods
could counteract this effect by reacting dynamically to an
To ensure a fair comparison, this procedure should be fol- algorithm’s behaviour on a given seed. We believe this
lowed for all RL methods used, including the baselines. is a promising direction for future work since in our ex-
If existing hyperparameters are re-used, their source and periments, PBT methods yield fairly static configurations
tuning protocol should be reported. In addition, their cor- instead of flexible schedules. Benchmarks like the recent
responding budget and search space should be the same as AutoRL-Bench (Shala et al., 2022) accelerate progress by
those of the other RL methods used for comparison. In case comparing AutoRL tools without the need for RL algorithm
the budget is runtime and not e.g. a number of environment evaluations. Lastly, higher-level AutoRL approaches that
steps, it is also important to use comparable hardware for all do not aim to find hyperparameter values but replace them
runs. Furthermore, it is important to use the same test seeds entirely by directing the algorithm’s behavior could in the
for all configurations that are separate from all tuning seeds. long term both simplify and stabilize RL algorithms. Ex-
If this information is not available, re-tuning the algorithm amples include exploration strategies (Zhang et al., 2021b),
is preferred. This procedure, including all information on learnt optimizers (Metz et al., 2022) or entirely new algo-
the search space, cost metric, HPO method settings, seeds rithms (Co-Reyes et al., 2021; Lu et al., 2022).
and final hyperparameters should be reported. We provide a
checklist containing all of these points in Appendix A and
as a LaTeX template in our GitHub repository. Acknowledgements
Marius Lindauer acknowledges funding by
7. Conclusion the European Union (ERC, “ixAutoML”,
grant no.101041029). Views and opinions expressed are
We showed that hyperparameters in RL deserve more at- those of the author(s) only and do not necessarily reflect
tention from the research community than they currently those of the European Union or the European Research
receive. Underreported tuning practices have the potential Council Executive Agency. Neither the European Union nor
to distort algorithm evaluations while ignored hyperparame- the granting authority can be held responsible for them.
ters may lead to suboptimal performance. With only small
budgets, we demonstrate that HPO tools like DEHB can
cover large search spaces to produce better performing con-
figurations using fewer computational resources than hyper-

9
Hyperparameters in RL and How To Tune Them 10

References J., Wolski, F., and Zhang, S. Dota 2 with large scale deep
reinforcement learning. CoRR, abs/1912.06680, 2019.
Adriaensen, S., Biedenkapp, A., Shala, G., Awad, N., Eimer,
T., Lindauer, M., and Hutter, F. Automated dynamic Biedenkapp, A., Bozkurt, H. F., Eimer, T., Hutter, F., and
algorithm configuration. Journal of Artificial Intelligence Lindauer, M. Dynamic Algorithm Configuration: Foun-
Research, 2022. dation of a New Meta-Algorithmic Framework. In Lang,
Agarwal, R., Schwarzer, M., Castro, P., Courville, A., and J., Giacomo, G. D., Dilkina, B., and Milano, M. (eds.),
Bellemare, M. Deep reinforcement learning at the edge Proceedings of the Twenty-fourth European Conference
of the statistical precipice. In Ranzato, M., Beygelzimer, on Artificial Intelligence (ECAI’20), pp. 427–434, June
A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), 2020.
Advances in Neural Information Processing Systems 34:
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Annual Conference on Neural Information Processing
Schulman, J., Tang, J., and Zaremba, W. Openai gym,
Systems 2021, NeurIPS, pp. 29304–29320, 2021.
2016.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.
Optuna: A next-generation hyperparameter optimization Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis-
framework. In Proceedings of the 25rd ACM SIGKDD tic gridworld environment for gymnasium, 2018. URL
International Conference on Knowledge Discovery and https://github.com/Farama-Foundation/
Data Mining, 2019. Minigrid.

Andrychowicz, M., Raichuk, A., Stanczyk, P., Orsini, M., Co-Reyes, J., Miao, Y., Peng, D., Real, E., Le, Q., Levine,
Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, S., Lee, H., and Faust, A. Evolving reinforcement
O., Michalski, M., Gelly, S., and Bachem, O. What learning algorithms. In 9th International Conference
matters for on-policy deep actor-critic methods? A large- on Learning Representations, ICLR. OpenReview.net,
scale study. In 9th International Conference on Learning 2021. URL https://openreview.net/forum?
Representations, ICLR. OpenReview.net, 2021. id=0XXpJ4OtjW.

Awad, N., Mallik, N., and Hutter, F. DEHB: evolutionary Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever-
hyberband for scalable, robust and efficient hyperparam- aging procedural generation to benchmark reinforcement
eter optimization. In Zhou, Z. (ed.), Proceedings of the learning. In Proceedings of the 37th International Con-
Thirtieth International Joint Conference on Artificial In- ference on Machine Learning, ICML, volume 119 of Pro-
telligence, IJCAI, pp. 2147–2153. ijcai.org, 2021. ceedings of Machine Learning Research, pp. 2048–2056.
PMLR, 2020.
Badia, A., Piot, B., Kapturowski, S., Sprechmann, P., Vitvit-
skyi, A., Guo, Z., and Blundell, C. Agent57: Outper- Duan, Y., Schulman, J., Chen, X., Bartlett, P., Sutskever,
forming the atari human benchmark. In Proceedings of I., and Abbeel, P. Rl$ˆ2$: Fast reinforcement learning
the 37th International Conference on Machine Learning, via slow reinforcement learning. CoRR, abs/1611.02779,
ICML, volume 119 of Proceedings of Machine Learning 2016.
Research, pp. 507–517. PMLR, 2020.
Eggensperger, K., Lindauer, M., and Hutter, F. Neural net-
Bakshy, E., Dworkin, L., Karrer, B., Kashin, K., Letham, works for predicting algorithm runtime distributions. In
B., Murthy, A., and Singh, S. Ae: A domain-agnostic Lang, J. (ed.), Proceedings of the Twenty-Seventh Interna-
platform for adaptive experimentation. 2018. tional Joint Conference on Artificial Intelligence (IJCAI),
Bechtle, S., Molchanov, A., Chebotar, Y., Grefenstette, E., pp. 1442–1448. ijcai.org, 2018.
Righetti, L., Sukhatme, G., and Meier, F. Meta learning
Eggensperger, K., Lindauer, M., and Hutter, F. Pitfalls and
via learned loss. In 25th International Conference on
best practices in algorithm configuration. pp. 861–893,
Pattern Recognition, ICPR, pp. 4161–4168. IEEE, 2020.
2019.
Bergstra, J. and Bengio, Y. Random search for hyper-
parameter optimization. 13:281–305, 2012. Eggensperger, K., Müller, P., Mallik, N., Feurer, M., Sass,
R., Klein, A., Awad, N., Lindauer, M., and Hutter, F.
Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Hpobench: A collection of reproducible multi-fidelity
Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, benchmark problems for HPO. In Vanschoren, J. and
C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Yeung, S. (eds.), Proceedings of the Neural Information
Petrov, M., de Oliveira Pinto, H., Raiman, J., Salimans, T., Processing Systems Track on Datasets and Benchmarks
Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, 1, NeurIPS Datasets and Benchmarks, 2021.

10
Hyperparameters in RL and How To Tune Them 11

Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.,
F., Rudolph, L., and Madry, A. Implementation matters Donahue, J., Razavi, A., Vinyals, O., Green, T., Dun-
in deep RL: A case study on PPO and TRPO. In 8th ning, I., Simonyan, K., Fernando, C., and Kavukcuoglu,
International Conference on Learning Representations, K. Population based training of neural networks.
ICLR. OpenReview.net, 2020. arXiv:1711.09846 [cs.LG], 2017.
Flennerhag, S., Schroecker, Y., Zahavy, T., van Hasselt, H., Kiran, M. and Ozyildirim, B. Hyperparameter tuning
Silver, D., and Singh, S. Bootstrapped meta-learning. In for deep reinforcement learning applications. CoRR,
The Tenth International Conference on Learning Repre- abs/2201.11182, 2022. URL https://arxiv.org/
sentations, ICLR. OpenReview.net, 2022. abs/2201.11182.
Franke, J., Köhler, G., Biedenkapp, A., and Hutter, F.
Sample-efficient automated deep reinforcement learning. Li, A., Spyra, O., Perel, S., Dalibard, V., Jaderberg, M., Gu,
In 9th International Conference on Learning Representa- C., Budden, D., Harley, T., and Gupta, P. A generalized
tions, ICLR. OpenReview.net, 2021. framework for population based training. In Teredesai,
A., Kumar, V., Li, Y., Rosales, R., Terzi, E., and Karypis,
Franke, J. K., Köhler, G., Biedenkapp, A., and Hutter, F. G. (eds.), Proceedings of the 25th ACM SIGKDD Inter-
Sample-efficient automated deep reinforcement learning. national Conference on Knowledge Discovery & Data
arXiv:2009.01555 [cs.LG], 2020. Mining, KDD, pp. 1791–1799. ACM, 2019.

Freeman, C., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and
and Bachem, O. Brax - A differentiable physics engine for Talwalkar, A. Hyperband: A novel bandit-based approach
large scale rigid body simulation. In Vanschoren, J. and to hyperparameter optimization. 18(185):1–52, 2018.
Yeung, S. (eds.), Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.,
1, NeurIPS Datasets and Benchmarks 2021, 2021. and Stoica, I. Tune: A research platform for distributed
model selection and training. CoRR, abs/1807.05118,
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft
2018.
actor-critic: Off-policy maximum entropy deep reinforce-
ment learning with a stochastic actor. In Dy, J. G. and
Lindauer, M. and Hutter, F. Best practices for scientific re-
Krause, A. (eds.), Proceedings of the 35th International
search on neural architecture search. Journal of Machine
Conference on Machine Learning, ICML, volume 80 of
Learning Research, 21:1–18, 2020.
Proceedings of Machine Learning Research, pp. 1856–
1865. PMLR, 2018. Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp,
Hambro, E., Raileanu, R., Rothermel, D., Mella, V., A., Deng, D., Benjamins, C., Ruhkopf, T., Sass, R., and
Rocktäschel, T., Küttler, H., and Murray, N. Dungeons Hutter, F. SMAC3: A versatile bayesian optimization
and data: A large-scale nethack dataset. 2022. package for hyperparameter optimization. J. Mach. Learn.
Res., 23:54:1–54:9, 2022.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Pre-
cup, D., and Meger, D. Deep reinforcement learning Lu, C., Kuba, J., Letcher, A., Metz, L., de Witt, C., and
that matters. In McIlraith, S. and Weinberger, K. (eds.), Foerster, J. Discovered policy optimisation. CoRR,
Proceedings of the Conference on Artificial Intelligence abs/2210.05639, 2022.
(AAAI’18). AAAI Press, 2018.
Makarova, A., Shen, H., Perrone, V., Klein, A., Faddoul, J.,
Hsu, C., Mendler-Dünner, C., and Hardt, M. Revisiting Krause, A., Seeger, M., and Archambeau, C. Automatic
design choices in proximal policy optimization. CoRR, termination for hyperparameter optimization. In Guyon,
abs/2009.10897, 2020. I., Lindauer, M., van der Schaar, M., Hutter, F., and Gar-
Huang, S., Dossa, R., Ye, C., and Braga, J. Cleanrl: High- nett, R. (eds.), International Conference on Automated
quality single-file implementations of deep reinforcement Machine Learning, AutoML, volume 188 of Proceedings
learning algorithms. CoRR, abs/2111.08819, 2021. of Machine Learning Research, pp. 7/1–21. PMLR, 2022.

Hutter, F., Hoos, H., and Leyton-Brown, K. An efficient ap- Metz, L., Harrison, J., Freeman, C., Merchant, A., Beyer,
proach for assessing hyperparameter importance. In Xing, L., Bradbury, J., Agrawal, N., Poole, B., Mordatch,
E. and Jebara, T. (eds.), Proceedings of the 31th Interna- I., Roberts, A., and Sohl-Dickstein, J. Velo: Train-
tional Conference on Machine Learning, (ICML’14), pp. ing versatile learned optimizers by scaling up. CoRR,
754–762. Omnipress, 2014. abs/2211.09760, 2022.

11
Hyperparameters in RL and How To Tune Them 12

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Sass, R., Bergman, E., Biedenkapp, A., Hutter, F., and
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Lindauer, M. Deepcave: An interactive analysis tool
Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., for automated machine learning. CoRR, abs/2206.03493,
Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier- 2022.
stra, D., Legg, S., and Hassabis, D. Human-level control
through deep reinforcement learning. Nature, 518(7540): Schede, E., Brandt, J., Tornede, A., Wever, M., Bengs, V.,
529–533, 2015. Hüllermeier, E., and Tierney, K. A survey of methods for
automated algorithm configuration. J. Artif. Intell. Res.,
Obando-Ceron, J. and Castro, P. Revisiting rainbow: Pro- 75:425–487, 2022.
moting more insightful and inclusive deep reinforcement
learning research. In Meila, M. and Zhang, T. (eds.), Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Proceedings of the 38th International Conference on Ma- Klimov, O. Proximal policy optimization algorithms.
chine Learning, ICML, volume 139 of Proceedings of Ma- arXiv:1707.06347 [cs.LG], 2017.
chine Learning Research, pp. 1373–1383. PMLR, 2021. Shala, G., Arango, S., Biedenkapp, A., Hutter, F., and
Grabocka, J. Autorl-bench 1.0. In Workshop on Meta-
Parker-Holder, J., Nguyen, V., and Roberts, S. Prov-
Learning (MetaLearn@NeurIPS’22), 2022.
ably efficient online hyperparameter optimization with
population-based bandits. In Larochelle, H., Ranzato, M., Storn, R. and Price, K. Differential evolution - A simple and
Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances efficient heuristic for global optimization over continuous
in Neural Information Processing Systems 33: Annual spaces. J. Glob. Optim., 11(4):341–359, 1997.
Conference on Neural Information Processing Systems
2020, NeurIPS, 2020. Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laakso-
nen, E., Xu, Z., and Guyon, I. Bayesian optimization is
Parker-Holder, J., Rajan, R., Song, X., Biedenkapp, A., superior to random search for machine learning hyperpa-
Miao, Y., Eimer, T., Zhang, B., Nguyen, V., Calandra, R., rameter tuning: Analysis of the black-box optimization
Faust, A., Hutter, F., and Lindauer, M. Automated rein- challenge 2020. CoRR, abs/2104.10201, 2021.
forcement learning (autorl): A survey and open problems.
J. Artif. Intell. Res., 74:517–568, 2022. Wan, X., Lu, C., Parker-Holder, J., Ball, P., Nguyen, V., Ru,
B., and Osborne, M. Bayesian generational population-
Paul, S., Kurin, V., and Whiteson, S. Fast efficient hyperpa- based training. In Guyon, I., Lindauer, M., van der Schaar,
rameter tuning for policy gradient methods. In Wallach, M., Hutter, F., and Garnett, R. (eds.), International Con-
H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., ference on Automated Machine Learning, AutoML, vol-
Fox, E. B., and Garnett, R. (eds.), Advances in Neural In- ume 188 of Proceedings of Machine Learning Research,
formation Processing Systems 32: Annual Conference on pp. 14/1–27. PMLR, 2022.
Neural Information Processing Systems 2019, NeurIPS,
pp. 4618–4628, 2019. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot,
M., and de Freitas, N. Dueling network architectures for
Pushak, Y. and Hoos, H. H. Automl loss landscapes. ACM deep reinforcement learning. In Balcan, M. and Wein-
Trans. Evol. Learn. Optim., 2(3):10:1–10:30, 2022. berger, K. (eds.), Proceedings of the 33rd International
Conference on Machine Learning (ICML’17), volume 48,
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, pp. 1995–2003. Proceedings of Machine Learning Re-
M., and Dormann, N. Stable-baselines3: Reliable re- search, 2016.
inforcement learning implementations. J. Mach. Learn.
Res., 22:268:1–268:8, 2021. Xu, Z., van Hasselt, H., and Silver, D. Meta-gradient re-
inforcement learning. In Bengio, S., Wallach, H. M.,
Raileanu, R. and Fergus, R. Decoupling value and policy Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Gar-
for generalization in reinforcement learning. In Meila, nett, R. (eds.), Advances in Neural Information Process-
M. and Zhang, T. (eds.), Proceedings of the 38th Interna- ing Systems 31: Annual Conference on Neural Informa-
tional Conference on Machine Learning, ICML, volume tion Processing Systems 2018, NeurIPS, pp. 2402–2413,
139 of Proceedings of Machine Learning Research, pp. 2018.
8787–8798. PMLR, 2021.
Xu, Z., van Hasselt, H., Hessel, M., Oh, J., Singh, S., and
Raileanu, R. and Rocktäschel, T. RIDE: rewarding impact- Silver, D. Meta-gradient reinforcement learning with an
driven exploration for procedurally-generated environ- objective discovered online. In Larochelle, H., Ranzato,
ments. In 8th International Conference on Learning M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances
Representations, ICLR. OpenReview.net, 2020. in Neural Information Processing Systems 33: Annual

12
Hyperparameters in RL and How To Tune Them 13

Conference on Neural Information Processing Systems


2020, NeurIPS, 2020.
Yadan, O. Hydra - a framework for elegantly configuring
complex applications. Github, 2019. URL https://
github.com/facebookresearch/hydra.
Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Has-
selt, H., Silver, D., and Singh, S. A self-tuning actor-critic
algorithm. In Larochelle, H., Ranzato, M., Hadsell, R.,
Balcan, M., and Lin, H. (eds.), Advances in Neural In-
formation Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020, NeurIPS,
2020.

Zhang, B., Rajan, R., Pineda, L., Lambert, N., Biedenkapp,


A., Chua, K., Hutter, F., and Calandra, R. On the impor-
tance of hyperparameter optimization for model-based
reinforcement learning. In Banerjee, A. and Fukumizu,
K. (eds.), The 24th International Conference on Artificial
Intelligence and Statistics, AISTATS, volume 130 of Pro-
ceedings of Machine Learning Research, pp. 4015–4023.
PMLR, 2021a.
Zhang, S. and Jiang, N. Towards hyperparameter-free pol-
icy selection for offline reinforcement learning. In Ran-
zato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P.,
and Vaughan, J. W. (eds.), Advances in Neural Infor-
mation Processing Systems 34: Annual Conference on
Neural Information Processing Systems 2021, NeurIPS,
pp. 12864–12875, 2021.
Zhang, T., Xu, H., Wang, X., Wu, Y., Keutzer, K., Gonzalez,
J., and Tian, Y. Noveld: A simple yet effective exploration
criterion. In Ranzato, M., Beygelzimer, A., Dauphin,
Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances
in Neural Information Processing Systems 34: Annual
Conference on Neural Information Processing Systems
2021, NeurIPS, pp. 25217–25230, 2021b.

Zimmer, L., Lindauer, M., and Hutter, F. Auto-pytorch:


Multi-fidelity metalearning for efficient and robust au-
todl. IEEE Trans. Pattern Anal. Mach. Intell., 43(9):
3079–3090, 2021.

13
Hyperparameters in RL and How To Tune Them 14

A. Reproducibility Checklist for Tuning Hyperparameters in RL


Below is a checklist we recommend for conducting experiments and reporting the process in RL. It is hard to give general
recommendations for all RL settings when it comes to questions of budget, number of seeds or configuration space size. For
guidance on an appropriate number of testing seeds, as well as recommendations on how to report them, see Agarwal et al.
(2021). The ideal number of tuning seeds will likely depend heavily on the domain, though we recommend to use at least 5
to avoid overtuning on a small number of seeds. As for configuration space size, we have seen successful tuning across up to
14 hyperparameters in this paper and only small differences between 3 and up to 9 hyperparameters in Section 4, so we
believe there is no reason be too selective for search spaces of around this size unless hyperparameter importances on the
algorithm and domain are fairly well known. Much larger search spaces could benefit from pruning, potentially after an
initial analysis of hyperparameter importance.

1. Are there training and test settings available on your chosen domains?
If yes:
• Is only the training setting used for training? ✓✗
• Is only the training setting used for tuning? ✓✗
• Are final results reported on the test setting? ✓✗
2. Hyperparameters were tuned using <package-name> which is based on <an-optimization-method>
3. The configuration space was: <algorithm-1>:
• <a-continuous-hyperparameter>: (<lower>, <upper>)
• <a-logspaced-continuous-hyperparameter>: log((<lower>, <upper>))
• <a-discrete-hyperparameter>: [<lower>, <upper>]
• <a-categorical-hyperparameter>: <choice-a>, <choice-b>
• ...
<algorithm-2>:
• <an-additional-hyperparameter>: (<lower>, <upper>)
• ...
4. The search space contains the same hyperparameters and search ranges wherever algorithms share hyperparameters ✓✗
If no, why not?
5. The cost metric(s) optimized was/were <a-cost-metric>
6. The tuning budget was <the-budget>
7. The tuning budget was the same for all tuned methods ✓✗
If no, why not?
8. If the budget is given in time: the hardware used for all tuning runs is comparable ✓✗
9. All methods that were reported were tuned with this the methods and settings described above ✓✗
If no, why not?
10. Tuning was done across < n > tuning seeds which were: [< 0 >, < 1 >, < 2 >, < 3 >, < 4 >]
11. Testing was done across < m > test seeds which were: [< 5 >, < 6 >, < 7 >, < 8 >, < 9 >]
12. Are all results reported on the test seeds? ✓✗
If no, why not?
13. The final incumbent configurations reported were:
<algorithm-1-env-1>:
• <a-hyperparameter>: <value>
• ...
<algorithm-1-env-2>:
• <a-hyperparameter>: <value>

14
Hyperparameters in RL and How To Tune Them 15

• ...

<algorithm-2-env-1>:

• <a-hyperparameter>: <value>
• ...

14. The code for reproducing these experiments is available at: <a-link>
15. The code also includes the tuning process ✓✗
16. Bundled with the code is an exact version of the original software environment, e.g. a conda environment file with all
package versions or a docker image in case some dependencies are not conda installable ✓✗
17. The following hardware was used in running the experiments:

• <n> <gpu-types>
• ...

B. Our AutoRL Hydra Sweepers


We provide implementations of DEHB and PBT variations to supplement existing options like Optuna (Akiba et al., 2019)
or ray (Liaw et al., 2018) with state-of-the-art HPO tools that still have relatively high barriers of use, particularly in RL.
Our goal was to provide a tuning option with as little human overhead as possible but as much flexibility for applying it
to different RL codebases as possible. Hydra allows us to do both by using the tuning algorithms as sweepers that launch
different configurations either locally or as parallel cluster jobs. In practice, this means minimal code changes are necessary
to use our sweepers: the return value will need to be a cost metric and for PBT checkpointing and loading is mandatory. In
this way, these plugins are compatible with any RL algorithm and environment.
Once these changes are implemented, a sweeper Hydra configuration that includes a search space definition can be used to
run the whole optimization process in one go or resume existing runs (e.g. if the optimization was terminated accidentally
or if more tuning budget becomes available after the fact). We include the option of using tuning seeds, which is so far
uncommon except for CleanRL (Huang et al., 2021) where they are user specified. Furthermore, we extended the option
of initial runs for PBT variations to the original PBT and PB2 instead of just BGT in order to stabilize those methods. In
comparison to existing user-friendly tuners like Optuna, we provide different tuning algorithms that are not BO-based and
include the option of using multi-fidelity tuning in hydra directly instead of having to implement a separate script.
Figure 7 shows an example Hydra configuration file turned into a ready-to-run tuning configuration file for tuning with
DEHB. The corresponding configuration file, here for the full search space of PPO in StableBaselines3, is shown in Figure 8.

C. Additional Background on Tuning Methods Used


Since many in the RL community might be unfamiliar with the state of the art in AutoML and AutoRL, we provide brief
descriptions of the RS, DEHB and PBT approaches we use in this paper.

C.1. Random Search


Random Search for hyperparameter optimization commonly refers to the method of sampling from a configuration space
in a pseudo-random fashion (Bergstra & Bengio, 2012). The resulting configurations are then evaluated on full algorithm
runs and the best performing one selected as incumbent. While RS is not as reliable with small budgets and larger search
spaces as other tuning options, it has proven to be a better alternative to grid search due to its scaling properties. As Grid
Search exhaustively evaluates all combinations of the given hyperparameter value set, it needs nm algorithm evaluations,
with n being the number of hyperparameters per dimension and m the number of dimensions in the search space. Still
we only evaluate n values for each dimension, irrelevant of how important this dimension actually is. RS implicitly shifts
more importance to the more relevant hyperparameters by varying the whole configuration at once, producing good results
on smaller budgets. Furthermore, Grid Search relies entirely on the domain knowledge of the user since they provide all
configurations. This is of course an issue with new methods, domains or if the optimal hyperparameter configuration falls
outside of the norm.

15
Hyperparameters in RL and How To Tune Them 16

Figure 7: A base Hydra configuration file (left) and the changes necessary to tune this algorithm with DEHB (right).

Figure 8: Example definition of a search space for PPO in a separate configuraion file.

16
Hyperparameters in RL and How To Tune Them 17

C.2. DEHB
DEHB is the combination of the evolutionary algorithm Differential Evolution (DE) (Storn & Price, 1997) and the multi-
fidelity method HyperBand (Li et al., 2018). HyperBand as a multi-fidelity method is based on the idea of running many
configurations with a small budget, i.e. only a fraction of training steps, and progressing promising ones to the next
higher budget level. In this way we see many datapoints, but avoid spending time on bad configurations. DEHB starts
with a full set of HyperBand budgets, from very low to full budget, and runs it in its first iteration. For each budget,
1
DEHB runs the equivalent of one full algorithm run in steps, e.g. if the current budget is 10 of the full run budget, 10
configurations will be evaluated. For the second one, the lowest budget is left out and the second lowest is initialised with a
population of configurations evolved by DEHB from the previous iteration’s results. This procedure continues until either a
maximum number of iterations is reached or only the full run budget budget is left. The number of budgets is decided by a
hyperparameter η.
In our experiments in Section 4 we run 3 iterations with η = 5 so only 3 budgets, and in our larger DEHB experiments in
1
Section 5 we use 2 iterations with η = 1.9 so 8 budgets. We set the minimum budget as 100 of the full run training steps in
each case.

C.3. PBT Variants


PBT is based on the idea of maintaining a population of agents in parallel, each with its own hyperparameter configuration.
These agents are then trained for n steps, after which their performance is evaluated and a checkpoint of their training state
is created. Now a portion of the worst agents are replaced by the best ones and the rest of the configurations are refined for
the next iteration. This will result in a hyperparameter schedule utilizing the best performing configurations at each iteration.
The original PBT (Li et al., 2019) randomly samples the initial configurations and then subsequently perturbes them by either
randomly increasing or decreasing each hyperparameter by a constant factor. Categorical values are randomly resampled
with a fixed probability.
This undirected sampling proved successful, but only with large population sizes upwards of 64 agents, therefore newer
iterations of PBT often use a model to select new hyperparameter configurations, as e.g. PB2 uses Bayesian Optimization
with a Gaussian Process (Parker-Holder et al., 2020). This enables optimization with a significantly smaller population size
of as little as 4 agents.
As we have seen, however, the result can be volatile, therefore Wan et al. (2022) suggested two main extensions on top of
PB2, in addition to the ability to tune the architecture with the PBT framwork, forming the current state of the art across
PBT methods. The extensions are (1) the use of periodic kernel restarts in case no improvements are visible and (2) the use
of full budget initial runs to warmstart the Gaussian Process with high quality datapoints.
In our experiments we use 20 configuration changes for each method with a population size of 8 for PB2 in Section 4 and
16/64 for PB2 in Section 5. For BGT, we use 8 initial runs and a population size of 8 for the smaller budget and 48 and 16,
respectively for the larger one. In both PB2 and BGT, we always replace the worst 12.5% of agents with the best 12.5%.

D. An Overview of Hyperparameter Configurations & Search Spaces


D.1. Stable Baselines Default Configurations
Table 3 shows the default hyperparameters we use throughout Section 4.

D.2. Stable Baseline Sweep Values


We sweep over the same hyperparameter values for each environment one dimension at a time. For PPO, these are
learning rate ∈ {1e−2, 5e−3, 1e−3, 5e−4, 1e−4, 5e−5, 1e−5, 5e−6, 1e−6, 5e−7}, entropy coefficient
∈ {0.1, 0.05, 0.01, 0.005, 0.001} and clip range ∈ {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.
For SAC, learning rate ∈ {1e − 2, 5e − 3, 1e − 3, 5e − 4, 1e − 4, 5e − 5, 1e − 5, 5e − 6, 1e − 6, 5e − 7}, tau
∈ {1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1} and training frequency ∈ {1, 2, 4, 8, 16}.
For DQN, learning rate ∈ {1e − 2, 5e − 3, 1e − 3, 5e − 4, 1e − 4, 5e − 5, 1e − 5, 5e − 6, 1e − 6, 5e − 7}, epsilon
∈ {1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1} and training frequency ∈ {1, 2, 4, 8, 16}.

17
Hyperparameters in RL and How To Tune Them 18

Acrobot & Pendulum Brax MiniGrid


Policy Class MlpPolicy MlpPolicy CnnPolicy
leaning rate 1e-3 1e-4 7e-4
batch size 64 512 64
gamma 0.9 0.99 0.999
n steps 1024 1024 256
n epochs 10 16 4
gae lambda 0.95 0.96 0.95
PPO

clip range 0.2 0.2 0.2


clip range vf null null null
normalize advantage True True True
ent coef 0.01 0.01 0.01
vf coef 0.5 0.5 0.5
max grad norm 0.5 0.5 0.5
use sde False False False
sde sample freq 4 4 4
Policy Class MlpPolicy MlpPolicy
leaning rate 1e-4 1e-4
batch size 256 512
gamma 0.99 0.99
tau 1.0 1.0
SAC

learning starts 100 100


buffer size 1000000 1000000
train freq 1 1
gradient steps 1 1
use sde False False
sde sample freq -1 -1
Policy Class MlpPolicy CnnPolicy
learning rate 1e-3 5e-7
batch size 64 64
tau 1.0 1.0
gamma 0.9 0.999
learning starts 50000 100
DQN

train freq 4 4
gradient steps 1 1
exploration fraction 0.1 0.1
exploration initial eps 1.0 1.0
exploration final eps 0.05 0.05
buffer size 1000000 1000000
Table 3: StableBaselines hyperparameter defaults for different environments.

18
Hyperparameters in RL and How To Tune Them 19

Hyperparameter Full Space Small Space LR Only


leaning rate log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1))
ent coef interval(0.0, 0.5) interval(0.0, 0.5)
n epochs range[5,20] range[5,20]
batch size {16, 32, 64, 128}
n steps {256, 512, 1024, 2048, 4096}
PPO

gae lambda interval(0.8, 0.9999)


clip range interval(0.0, 0.5)
clip range vf interval(0.0, 0.5)
normalize advantage {True, False}
vf coef interval(0.0, 1.0)
max grad norm interval(0.0, 1.0)
leaning rate log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1))
train freq range[1,1e3] range[1,1e3]
tau interval(0.01, 1.0) interval(0.01, 1.0)
SAC

batch size {64, 128, 256, 512}


learning starts range[0,1e4]
buffer size range[5e3,5e7]
gradient steps range[1,10]
learning rate log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1)) log(interval(1e-6, 0.1))
batch size {4, 8, 16, 32} {4, 8, 16, 32}
exploration fraction interval(0.005, 0.5) interval(0.005, 0.5)
learning starts range[0,1e4]
DQN

train freq range[1,1e3]


gradient steps range[1,10]
exploration initial eps interval(0.5, 1.0)
exploration final eps interval(0.001, 0.2)
buffer size range[5e3,5e7]
Table 4: StableBaselines search spaces.

D.3. Stable Baselines Search Spaces


Table 4 shows the search spaces we used for the experiments in Section 4. The search spaces are the same for all tuning
methods and across environments. We denote floating point intervals as interval(lower,upper), integer ranges as
range[lower,upper], categorical choices as {choice 1, choice 2} and add log if the search space for this
hyperparameter is traversed logarithmically.

D.4. Brax Experiment Settings


We base our implementations the training code provided with Brax with minor additions like only a
single final evaluation and agent loading. The GitHub commit ID for the code version we use is
3843d433050a08cb492c301e039e04409b3557fc. The cost metric we optimize is the evaluation reward across
one episode of the environment batch. We tune on seeds 0 − 4 and evaluate on seeds 5 − 14. The baseline hyperparameters
are taken from this commit as well and are shown in 5 together with our search space.
For both Procgen and Brax, we compute the rank of a method as follows: the best performing method on the test seeds and
all other methods within its standard deviation receive rank one. The method with the next best mean (and all methods in
its standard deviation) receive the next free rank – 2 in case there was a single best method, 3 if there were two and so on.
These ranks are determined for each environment from which we can compute a mean across the whole domain.

19
Hyperparameters in RL and How To Tune Them 20

Hyperparameter Search Space Defaults


leaning rate log(interval(1e-6, 0.1)) 3e-4
num update epochs range[1,15] 4
batch size {128,256,512,1024,2048} 1024
num minibatches range[0,7] 6
entropy cost interval(0.0001, 0.5) 1e-2
gae lambda interval(0.5, 0.9999) 0.95
epsilon interval(0.01, 0.9) 0.3
vf coef interval(0.01, 0.9) 0.5
reward scaling interval(0.01, 1.0) 0.1
Table 5: Search Space and baseline hyperparameters for Brax. Actual number of minibatches is 2num minibatches . Epsilon refers to the
clip range in this implementation.

Hyperparameter Search Space Bigfish Climber Plunder


lr log(interval(1e-6, 0.1)) 5e-4 5e-4 5e-4
eps log(interval(1e-6, 0.1)) 1e-5 1e-5 1e-5
hidden size - 256 256 256
clip param interval(0.0, 0.5) 0.2 0.2 0.2
num mini batch - 8 8 8
ppo epoch range[1, 5] 3 3 3
num steps - 256 256 256
max grad norm interval(0.0, 1.0) 0.5 0.5 0.5
value loss coef interval(0.0, 1.0) 0.5 0.5 0.5
entropy coef interval(0.0, 0.5) 0.01 0.01 0.01
gae lambda interval(0.8, 0.9999) 0.95 0.95 0.95
gamma - 0.999 0.999 0.999
alpha interval(0.8, 0.9999) 0.99 0.99 0.99
clf hidden size - 4 64 4
order loss coef interval(0.0, 0.1) 0.01 0.001 0.1
use nonlinear clf {True, False} True True False
adv loss coef interval(0.0, 1.0) 0.05 0.25 0.3
value freq range[1, 5] 32 1 8
value epoch range[1, 10] 9 9 1
Table 6: Search Space and baselines hyperparameters for IDAAC on Procgen.

D.5. Procgen Experiment Settings


We use the open-source code provided by (Raileanu & Fergus, 2021) with minor additions like loading agents. The GitHub
commit ID for the code version we use is 2fe30202942898b1b09d76e5d8c71d5a7db3686b. The cost metric we
optimize is the evaluation reward across ten episodes of the training environment. We tune on seeds 0 − 4 and evaluate
on seeds 5 − 14. Our baseline are the provided best hyperparameters per environment (see 6 for the configuration and our
search space).

D.6. Hardware
All of our experiments were run on a compute cluster with two Intel CPUs per node (these were used for the experiments in
Section 4) and four different node configurations for GPU (used for the experiments in Section 5). These configurations
are: 2 Pascal 130 GPUs, 2 Pascal 144 GPUs, 8 Volta 10 GPUs or 8 Volta 332. We ran the CPU experiments with 10GB of
memory on single nodes and the GPU experiments with 10GB for Procgen and 40GB for Brax on a single GPU each.

20
Hyperparameters in RL and How To Tune Them 21

Hyperparameter Default DEHB PB2 RS


leaning rate 1e-3 5e-05 5e-3 for 90e5 steps, then 3e-6 1e-6
batch size 64 64 64 for 90e5 steps, then 128 64
n steps 1024 1024 256 256
n epochs 10 14 8 for 90e5 steps, then 20 6
gae lambda 0.95 0.82 0.85 for 90e5 steps, then 0.82 0.81
PPO Acrobot clip range 0.2 0.14 0.06 for 90e5 steps, then 0.27 0.05
clip range vf null 0.43 0.06 0.11
normalize advantage True True True
ent coef 0.01 0.21 0.42 for 90e5 steps, then 0.29 0.01
vf coef 0.5 0.02 0.5 for 90e5 steps, then 0.7 0.07
max grad norm 0.5 0.75 0.24 for 90e5 steps, then 0.5 0.97
Table 7: StableBaselines hyperparameter defaults for different environments.

E. Details on the Tuned Configurations


We want to give some insights into how much the incumbents of our HPO methods differ from the baselines and one another.
We show an example comparisons between different incumbent configurations on the full search space of PPO on Acrobot
in Table 7. This result is consistent with what we find across other algorithms and environments: the differences between
incumbents as well as between incumbents and baseline are fairly large. The result of HPO in our experiment has not meant
small changes to only a subset of the search space, but usually significant deviations from the baseline in most of them.
Still, we can see some similarities at times, in this case the batch size stays consistently at 64 across all configurations
(with the exception of the final training phase of PB2). We can also see common trends among the incumbents at times,
e.g. in the value of the GAE λ which is between 0.81 and 0.85 for the incumbents, but at 0.95 in the default configuration.
Unfortunately, the other hyperparameter values do not seem to share any trends and often have significantly different values
as e.g. in the entropy coefficient which varies between 0.01 and 0.42.
Why do we then see such similar performance from all of these configurations? We believe three main factors are at
play: hyperparameter importance, the algorithm’s sensitivity to a hyperparameter value and interaction effects between
hyperparameters. It is likely that not all hyperparameters are crucial to optimize in this setting, so seeing very different
values for unimportant hyperparameters can make the configurations appear more different than they are. We know from our
experiments, however, that a mistake in the entropy coefficient can be highly damaging to the algorithm’s performance in
Acrobot (see Figure 21 below). Comparing the entropy coefficient curve of Acrobot and Ant in this figure, however, reveals
that the median performance across different entropy coefficients degrades much less quickly for Acrobot than for Ant – on
Acrobot, PPO is less sensitive to changes in hyperparameter values. To put it another way: hyperparameter values may look
different between configurations but result in the same algorithm behaviour as long as they are within a similar range. Lastly,
since we optimize many hyperparameters and the agent’s behaviour depends on these hyperparameters, it is possible that
hyperparameters interact with each other to produce a similar outcome as long as their relation stays similar. This could be
an explanation for combining lower learning rates with more update epochs as DEHB does. Analysing hyperparameter
configurations on their own, however, will not provide enough information to determine each of these factors for each
hyperparameter. They have to be explored through separate experiments first before we can draw conclusions on how similar
the configurations we found in HPO tools are and what that means for optimal hyperparameter values in our settings.

F. Tuning Results on Brax & Procgen in Tabular Form


For ease of comparison, we provide the results of Section 5 in tabular form.

21
Hyperparameters in RL and How To Tune Them 22

Table 8: Tuning PPO on Brax’s Ant, Halfcheetah and Humanoid environments. Shown are tuning results across 3 runs across 5 seeds
each, tested on 10 different test seeds.

Ant Halfcheetah Humanoid


Baseline 3448 ± 343 6904 ± 377 3235 ± 758
DEHB Inc. 5745 ± 878 3993 ± 871 1788 ± 718
DEHB Test 4288 ± 1017 4928 ± 500 3167 ± 874
RS Inc. 2515 ± 1750 2978 ± 1007 763 ± 317
RS Test 5165 ± 896 3646 ± 699 753 ± 209
DEHB Inc. (64) 7170 ± 1045 8202 ± 445 4338 ± 1655
DEHB Test (64 ) 4696 ± 1252 8039 ± 636 5205 ± 2781
BGT Inc. (64) 1119 ± 1321 1051 ± 752 434 ± 15
BGT Test (64) 3196 ± 3307 456 ± 461 132 ± 0
RS Inc. (64) 6344 ± 654 7891 ± 386 2932 ± 798
RS Test (64) 669 ± 2447 950 ± 1461 325 ± 162

Table 9: Tuning IDAAC on Procgen’s Bigfish, Climber and Plunder. Results are across 3 runs using 5 seeds each, and tested on 10
different test seeds.

Bigfish Climber Plunder


Baseline 6.8 ± 3.2 4.1 ± 1.4 11.8 ± 5.5
DEHB Inc. 7.3 ± 2.0 3.7 ± 0.2 5.8 ± 0.2
DEHB Test 11.9 ± 4.3 2.7 ± 1.5 8.6 ± 2.6
BGT Inc. 1.3 ± 0.2 2.5 ± 0.4 4.5 ± 0.3
BGT Test 0.9 ± 0.4 2.4 ± 1.1 5.3 ± 1.0
PB2 Inc. 26.1 ± 2.2 3.2 ± 0.3 4.7 ± 0.3
PB2 Test 3.4 ± 1.9 2.6 ± 1.1 8.3 ± 2.7
RS Inc. 4.4 ± 2.1 5.5 ± 0.6 6.2 ± 1.3
RS Test 7.4 ± 5.0 2.6 ± 1.0 4.7 ± 1.0
DEHB Inc. (64 runs) 11.5 ± 0.7 6.0 ± 1.0 7.3 ± 0.9
DEHB Test (64 runs) 9.4 ± 2.5 3.9 ± 1.9 8.7 ± 0.7
BGT Inc. (64 runs) 4.7 ± 5.1 3.2 ± 1.5 5.3 ± 0.1
BGT Test (64 runs) 2.1 ± 1.9 2.6 ± 0.4 5.9 ± 1.6
PB2 Inc. (64 runs) 10.5 ± 10.1 3.4 ± 0.3 5.2 ± 0.2
PB2 Test (64 runs) 2.1 ± 1.1 3.0 ± 0.9 4.3 ± 0.2
RS Inc. (64 runs) 1.9 ± 0.3 5.9 ± 0.8 6.7 ± 0.4
RS Test (64 runs) 1.1 ± 0.1 2.3 ± 0.7 3.6 ± 0.5

22
Hyperparameters in RL and How To Tune Them 23

G. Hyperparameter Sweeps for PPO, DQN and SAC


The full set of PPO sweeps can be found in Figures 9, 10, 11 and 12, the SAC sweeps in Figures 15 and 16 and the DQN
sweeps in Figures 13 and 10.

G.1. PPO Sweeps

Figure 9: Hyperparameter Sweeps for PPO on Pendulum and Acrobot.

23
Hyperparameters in RL and How To Tune Them 24

Figure 10: Hyperparameter Sweeps for PPO on Ant and Halfcheetah.

24
Hyperparameters in RL and How To Tune Them 25

Figure 11: Hyperparameter Sweeps for PPO on Humanoid.

25
Hyperparameters in RL and How To Tune Them 26

Figure 12: Hyperparameter Sweeps for PPO on MiniGrid.

26
Hyperparameters in RL and How To Tune Them 27

G.2. DQN Sweeps

Figure 13: Hyperparameter Sweeps for DQN on Acrobot.

27
Hyperparameters in RL and How To Tune Them 28

Figure 14: Hyperparameter Sweeps for DQN on MiniGrid.

28
Hyperparameters in RL and How To Tune Them 29

G.3. SAC Sweeps

Figure 15: Hyperparameter Sweeps for SAC on Pendulum and Ant.

29
Hyperparameters in RL and How To Tune Them 30

Figure 16: Hyperparameter Sweeps for SAC on Halfcheetah and Humanoid.

30
Hyperparameters in RL and How To Tune Them 31

H. Full Performance Pointplots


The full set of SAC pointplots can be found in Figures 17 and 18, the DQN pointplots in Figures 19 and 20 and the PPO
pointplots in Figures 21, 22, 23 and 24.

H.1. SAC Pointplots

Figure 17: Final returns across 5 seeds for different hp variations of SAC on Pendulum and Ant.

31
Hyperparameters in RL and How To Tune Them 32

Figure 18: Final returns across 5 seeds for different hp variations of SAC of Halfcheetah and Humanoid.

32
Hyperparameters in RL and How To Tune Them 33

H.2. DQN Pointplots

Figure 19: Final returns across 5 seeds for different hp variations of DQN on Acrobot.

33
Hyperparameters in RL and How To Tune Them 34

Figure 20: Final returns across 5 seeds for different hp variations of DQN on MiniGrid.

34
Hyperparameters in RL and How To Tune Them 35

H.3. PPO Pointplots

Figure 21: Final returns across 5 seeds for different hp variations of PPO on Pendulum and Acrobot.

35
Hyperparameters in RL and How To Tune Them 36

Figure 22: Final returns across 5 seeds for different hp variations of PPO on Ant and Halfcheetah.

36
Hyperparameters in RL and How To Tune Them 37

Figure 23: Final returns across 5 seeds for different hp variations of PPO on Humanoid.

37
Hyperparameters in RL and How To Tune Them 38

Figure 24: Final returns across 5 seeds for different hp variations of PPO on MiniGrid.

38
Hyperparameters in RL and How To Tune Them 39

I. Hyperparameter Importances using fANOVA


These hyperparameter importance plots were made using the fANOVA (Hutter et al., 2014) plugin of DeepCAVE (Sass
et al., 2022).

Figure 25: PPO Hyperparameter Importances on Acrobot (left) and Pendulum (right).

Figure 26: PPO Hyperparameter Importances on MiniGrid Empty (left) and Minigrid DoorKey(right).

39
Hyperparameters in RL and How To Tune Them 40

Figure 27: PPO Hyperparameter Importances on Brax Ant (left), Halfcheetah (middle) and Humanoid (right).

Figure 28: DQN Hyperparameter Importances on Acrobot (left), MiniGrid Empty (middle) and MiniGrid DoorKey (right).

Figure 29: SAC Hyperparameter Importances on Pendulum (left) and Brax Ant(right).

Figure 30: SAC Hyperparameter Importances on Brax Halfcheetah (left) and Humanoid (right).

40
Hyperparameters in RL and How To Tune Them 41

J. Partial Dependency Plots


These plots show performance (lighter is better) across the value ranges of two hyperparameters.

J.1. SAC on Pendulum

41
Hyperparameters in RL and How To Tune Them 42

42
Hyperparameters in RL and How To Tune Them 43

J.2. PPO on Acrobot

43
Hyperparameters in RL and How To Tune Them 44

44
Hyperparameters in RL and How To Tune Them 45

45
Hyperparameters in RL and How To Tune Them 46

46

You might also like