howtotune_hyperp_inRL
howtotune_hyperp_inRL
Abstract
In order to improve reproducibility, deep rein-
forcement learning (RL) has been adopting better
scientific practices such as standardized evalua-
arXiv:2306.01324v1 [cs.LG] 2 Jun 2023
1
Hyperparameters in RL and How To Tune Them 2
Hambro et al., 2022). We show that even often overlooked To summarize, our contributions are:
hyperparameters can make or break an algorithm’s success,
meaning that careful consideration is necessary for a broad 1. Exploration of the hyperparameter landscape for
range of hyperparameters. This is especially important for commonly-used RL algorithms and environments;
as-of-yet unexplored domains, as pointed out by Zhang et al.
2. Comparison of different types of HPO methods on
(2021a). Furthermore, hyperparameters cause different al-
state-of-the-art RL algorithms and challenging RL en-
gorithm behaviors depending on the random seed which is
vironments;
a well-known fact in AutoML (Eggensperger et al., 2019;
Lindauer & Hutter, 2020) but has not yet factored widely 3. Open-source implementations of advanced HPO meth-
into RL research, negatively impacting reproducibility. ods that can easily be used with any RL algorithm and
Fractured State of the Art in AutoRL Even though HPO environment; and
approaches have succeeded in tuning RL algorithms (Franke 4. Best practice recommendations for HPO in RL.
et al., 2021; Awad et al., 2021; Zhang et al., 2021a; Wan
et al., 2022), the costs and benefits of HPO are relatively
unknown in the community. AutoRL papers often compare
2. The Hyperparameter Optimization Problem
only a few HPO methods, are limited to single domains or We provide an overview of the most relevant formaliza-
toy problems, or use a single RL algorithm (Jaderberg et al., tions of HPO in RL, Algorithm Configuration (Schede et al.,
2017; Parker-Holder et al., 2020; Awad et al., 2021; Kiran & 2022) and Dynamic Algorithm Configuration (Adriaensen
Ozyildirim, 2022; Wan et al., 2022). In this work, we aim to et al., 2022). Algorithm Configuration (AC) is a popular
understand the need for and challenges of AutoRL by com- paradigm for optimizing hyperparameters of several differ-
paring multiple HPO methods across various state-of-the-art ent kinds of algorithms (Eggensperger et al., 2019).
RL algorithms on challenging environments. Our results Definition 2.1 (AC). Given an algorithm A, a hyperparam-
demonstrate that HPO approaches have better performance eter space Λ , as well as a distribution of environments or
and less compute overhead than hyperparameter sweeps or environment instances I, and a cost function c, find the
grid searches which are typically used in the RL community optimal configuration λ ∗ ∈ Λ across possible tasks s.t.:
(see Figure 1).
λ ∗ ∈ arg minλ ∈Λ
Λ Ei∼I [c(A(i; λ ))].
Ease of Use State-of-the-art AutoML tools are often re-
leased as research papers rather than standalone packages. The cost function could be the negative of the agent’s reward
In addition, they are not immediately compatible with stan- or a failure indicator across a distribution of tasks. Thus
dard RL code, while easy to use solutions like Optuna (Ak- it is quite flexible and can accommodate a diverse set of
iba et al., 2019) or Ax (Bakshy et al., 2018) only provide a possible goals for algorithm performance. This definition
limited selection of HPO aproaches. To improve the avail- is not restricted to one train and test setting but aims to
ability of these tools, we provide Hydra sweepers (Yadan, achieve the best possible performance across a range of en-
2019) for several variations of population-based methods, vironments or environment instances. AC approaches thus
such as standard PBT (Jaderberg et al., 2017), PB2 (Parker- strive to avoid overfitting the hyperparameters to a specific
Holder et al., 2020) and BGT (Wan et al., 2022), as well scenario. Even for RL problems focusing on generalization,
as the evolutionary algorithm DEHB (Awad et al., 2021). AC is therefore a suitable framework. Commonly, the HPO
Note that all of these have shown to improve over random process is terminated before we have found the true λ ∗ via
search for tuning RL agents. As black-box methods, they an optimization budget (e.g. the runtime or number of train-
are compatible with any RL algorithm or environment and ing steps). The best found hyperparameter configuration
due to Hydra, users do not have to change their implementa- found by the optimization process is called the incumbent.
tion besides returning a success metric like the reward once
training is finished. Based on our empirical insights, we Another relevant paradigm for tuning RL is Dynamic Algo-
provide best practice guidelines on how to use HPO for RL. rithm Configuration (DAC) (Biedenkapp et al., 2020; Adri-
aensen et al., 2022). DAC is a generalization of AC that
In this paper, we demonstrate that compared to tuning does not search for a single optimal hyperparameter value
hyperparameters by hand, existing HPO tools are ca- per algorithm run but instead for a sequence of values.
pable of producing better performing, more stable, and Definition 2.2 (DAC). Given an algorithm A, a hyperpa-
more easily comparable RL agents, while using fewer rameter space Λ as well as a distribution of environments or
computational resources. We believe widespread adoption environment instances I with state space S, cost function c
of HPO protocols within the RL community will therefore and a space of dynamic configuration policies Π with each
result in more accurate and fair comparisons across RL π ∈ Π : S × I → Λ , find π ∗ ∈ Π s.t.:
methods and in the end to faster progress.
π ∗ ∈ arg minπ∈Π Ei∼I c(A(i; π))
2
Hyperparameters in RL and How To Tune Them 3
As RL is a dynamic optimization process, it can benefit significant performance and efficiency gains, offering a RL-
from dynamic changes in the hyperparameter values such as specific way of optimizing hyperparameters during training.
learning rate schedules (Zhang et al., 2021a; Parker-Holder A benefit of PBT methods is that they implicitly find a
et al., 2022). Thus HPO tools developed specifically for RL schedule of hyperparameter settings instead of a fixed value.
have been following the DAC paradigm in order to tailor
Beyond PBT methods, many general AC algorithms have
the hyperparameter values closer to the training progress
proven to perform well on ML and RL tasks (Schede et al.,
(Franke et al., 2020; Zhang et al., 2021a; Wan et al., 2022).
2022). A few such examples are SMAC (Lindauer et al.,
It is worth noting that while the model architecture can be 2022) and DEHB (Awad et al., 2021) which are based on
defined by a set of hyperparameters like the number of lay- Bayesian Optimization and evolutionary algorithms, respec-
ers, architecture search is generally more complex and thus tively. SMAC is model-based (i.e. it learns a model of the
separately defined as the NAS problem or combined with hyperparameter landscape using a Gaussian process) and
HPO to form the general AutoDL problem (Zimmer et al., both are multi-fidelity methods (i.e. they utilize shorter
2021). While some tools include options for optimizing training runs to test many different configurations, only
architecture hyperparameters, insights into how to find good progressing the best ones). While these algorithms have
architectures for RL are out of scope for this paper. rarely been used in RL so far, there is no evidence to sug-
gest they perform any worse than RL-specific optimization
3. Related Work approaches. In fact, a possible advantage of multi-fidelity
approaches over population-based ones is that given the
While RL as a field has seen many innovations in the last same budget, multi-fidelity methods see a larger number
years, small changes to the algorithm or its implemen- of total configurations, while population-based ones see a
tation can have a big impact on its results (Henderson smaller number of configurations trained for a longer time.
et al., 2018; Andrychowicz et al., 2021; Engstrom et al.,
2020). In an effort to consolidate these innovations, sev- 4. The Hyperparameter Landscape of RL
eral papers have examined the effect of smaller design de-
cisions like the loss function or policy regularization for Before comparing HPO algorithms, we empirically motivate
on-policy algorithms (Hsu et al., 2020; Andrychowicz et al., why using dedicated tuning tools is important in RL. To this
2021), DQN (Obando-Ceron & Castro, 2021) and offline end we study the effect of hyperparameters as well as that of
RL (Zhang & Jiang, 2021). AutoRL methods, on the other the random seed on the final performance of RL algorithms.
hand, have focused on automating and abstracting some We also investigate the smoothness of the hyperparameter
of these decisions (Parker-Holder et al., 2022) by using space. The goal of this section is not to achieve the best
data-driven approaches to learn various algorithmic compo- possible results on each task but to gather insights into
nents (Bechtle et al., 2020; Xu et al., 2020; Metz et al., 2022) how hyperparameters affect RL algorithms and how we can
or even entire RL algorithms (Wang et al., 2016; Duan et al., optimize them effectively.
2016; Co-Reyes et al., 2021; Lu et al., 2022).
Experimental Setup To gain robust insights into the im-
While overall there has been less interest in hyperparameter pact of hyperparameters on the performance of an RL
optimization, some RL-specific HPO algorithms have been agent, we consider a range of widely-used environments
developed. STACX (Zahavy et al., 2020) is an example of a and algorithms. We use basic gym environments such as
self-tuning algorithm, using meta-gradients (Xu et al., 2018) OpenAI’s Pendulum and Acrobot (Brockman et al., 2016),
to optimize its hyperparameters during runtime. This idea gridworld with an exploration component such as Mini-
has recently been generalized to bootstrapped meta-learning, Grid’s Empty and DoorKey 5x5 (Chevalier-Boisvert et al.,
enabling the use of meta-gradients to learn any combination 2018), as well as robot locomotion tasks such as Brax’s
of hyperparameters on most RL algorithms on the fly (Flen- Ant, Halfcheetah and Humanoid (Freeman et al., 2021). We
nerhag et al., 2022). Such gradient-based approaches are use PPO (Schulman et al., 2017) and DQN (Mnih et al.,
fairly general and have shown a lot of promise (Paul et al., 2015) for the discrete environments, and PPO as well as
2019). However, they require access to the algorithm’s SAC (Haarnoja et al., 2018) for the continuous ones, all
gradients, thus limiting their use and incurring a larger com- in their StableBaselines3 implementations (Raffin
pute overhead. In this paper, we focus on purely black-box et al., 2021). This selection is representative of the main
methods for their ease of use in any RL setting. classes of model-free RL algorithms (i.e. on-policy policy-
optimization, off-policy value-based, and off-policy actor-
Extensions of population-based training (PBT) (Jaderberg
critic) and covers a diverse set of tasks posing different
et al., 2017; Li et al., 2019) improvements like BO ker-
challenges (i.e. discrete and continuous control), allowing
nels (Parker-Holder et al., 2020) or added NAS compo-
us to draw meaningful and generalizable conclusions.
nents (Franke et al., 2020; Wan et al., 2022) have led to
3
Hyperparameters in RL and How To Tune Them 4
4
Hyperparameters in RL and How To Tune Them 5
5
Hyperparameters in RL and How To Tune Them 6
Table 2: Tuning PPO on Acrobot (top) and SAC on Pendulum (bottom) across the full search space and different numbers of seeds. Lower
numbers are better, best test performance for each method and values within its standard deviation are highlighted. Test performances are
aggregated across 10 separate test seeds using the mean for each tuning run. We report mean and standard deviation of these.
DEHB Inc. DEHB Test PB2 Inc. PB2 Test RS Inc. RS Test
1 Seed 70.6 ± 3.4 341.3 ± 183.1 305.3 ± 185.5 353.7 ± 134.5 77.8 ± 4.9 136.8 ± 70.5
Acrobot
3 Seeds 76.2 ± 0.9 381.1 ± 127.6 301.2 ± 128.0 411.3 ± 117.9 88.2 ± 5.7 98.8 ± 16.3
5 Seeds 79.3 ± 1.2 465.1 ± 24.6 228.5 ± 149.5 471.8 ± 19.1 89.2 ± 10.4 116.8 ± 43.3
10 Seeds 156.0 ± 24.5 464.8 ± 36.5 404.9 ± 53.3 474.4 ± 23.5 108.3 ± 28.2 100.1 ± 20.0
Pendulum
1 Seed 111.5 ± 23.6 150.5 ± 13.4 77.8 ± 19.0 840.7 ± 580.1 88.6 ± 24.9 168.3 ± 46.4
3 Seeds 125.0 ± 23.2 144.8 ± 9.0 133.3 ± 14.7 171.0 ± 35.5 150.7 ± 13.9 159.0 ± 21.6
5 Seeds 127.3 ± 11.5 350.2 ± 418.2 134.0 ± 22.1 661.3 ± 586.2 134.8 ± 9.8 397.8 ± 485.5
10 Seeds 742.4 ± 498.8 318.6 ± 281.3 282.0 ± 252.9 468.6 ± 437.9 144.5 ± 17.9 150.2 ± 4.8
vidually per hyperparameter as in Figure 4, we can see the the learning rate on Humanoid). In most cases, there is an
previously predictable behaviour is replaced with signifi- overlap between adjacent configurations, so it is certainly
cant differences across seeds. We observe single seeds with possible to select a presumably well-performing hyperpa-
crashing performance, inconsistent learning curves and also rameter configuration on one seed that has low average
exceptionally well performing seeds that end up outperform- performance across others.
ing the best seeds of configurations which are better on
As this is a known issue in other fields as well, albeit not
average. Given that we believe tuning only a few seeds of
to the same degree as in RL, it is common to evaluate a
the target RL algorithm is still the norm (Schulman et al.,
configuration on multiple seeds in order to achieve a more
2017; Berner et al., 2019; Raileanu & Rocktäschel, 2020;
reliable estimate of the true performance (Eggensperger
Badia et al., 2020; Hambro et al., 2022), such high variabil-
et al., 2018). We verify this for RL by comparing the final
ity with respect to the seed is likely a bigger difficulty factor
performance of agents tuned by DEHB and PB2 on the
for HPO in RL than the optimization landscape itself.
performance mean across a single, 3 or 5 seeds. We then
Thus, our conclusion is somewhat surprising: it should be test the overall best configuration on 5 unseen test seeds.
possible to tune RL hyperparameters just as well as the
Table 2 shows that RS is able to improve the average test
ones in any other fields without RL-specific additions to
performance on both Acrobot and Pendulum by increasing
the tuning algorithm since RL hyperparameter landscapes
the number of tuning seeds, as are DEHB and PB2 on Pendu-
appear to be rather smooth. The large influence of many
lum. However, this is only true up to a point as performance
different hyperparameters is a potential obstacle, however,
estimation across more than 3 seeds leads to a general de-
as are interaction effects that can ocurr between hyperpa-
crease in test performance, as well as a sharp increase in
rameters. Furthermore, RL’s sensitivity to the random seed,
variance in some cases (e.g. 5 seed RS or 10 seed PB2 on
can present a challenge in tuning its hyperparameters, both
Pendulum). Especially when tuning across 10 seeds, we see
by hand and in an automated manner.
that the incumbents suffer as well, indicating that evaluating
the configurations across multiple seeds increases the diffi-
4.3. How Do We Account for Noise? culty of the HPO problem substantially, even though it can
As the variability between random seeds is a potential source help avoid overfitting. The performance difference between
of error when tuning and running RL algorithms, we in- tuning and testing is significant in many cases and we can
vestigate how we can account for it in our experiments to see e.g. on Acrobot that the best incumbent configurations,
generate more reliable performance estimates. found by DEHB, perform more than four times worse on
test seeds. We can find this effect in all tuning methods,
As we have seen high variability both in performance and especially on Pendulum. This presents a challenge for re-
across seeds for different hyperparameter values, we return producibility given that currently it is almost impossible to
to Figure 2 to investigate how big the seed’s influence on the know what seeds were used for tuning or evaluation. Simply
final performance really is. The plots show that the standard reporting the performance of tuned seeds for the proposed
deviation of the performance for the same hyperparameter method and that of testing seeds for the baselines is an unfair
configuration can be very large. While this performance comparison which can lead to wrong conclusions.
spread tends to decrease for configurations with better me-
dian performance, top-performing seeds can stem from un- To summarize, we have seen that the main challenges are
stable configurations with low median performance (e.g. the size of the search space, the variability involved in train-
ing RL agents, and the challenging generalization across
6
Hyperparameters in RL and How To Tune Them 7
Figure 5: Tuning Results for PPO on Brax. Shown is the mean evaluation reward across 10 episodes for 3 tuning runs as well as the 98%
confidence interval across tuning runs.
random seeds. Since many hyperparameters have a large for supervised machine learning, we expect that RS should
influence on agent performance, but the optimization land- be outperformed by the other approaches. For each task, we
scape is relatively smooth, RL hyperparameters can be effi- work on the original open-sourced code of each state-of-the-
ciently tuned using HPO techniques, as we have shown in art RL method we test against, using the manually tuned
our experiments. Manual tuning, however, is comparatively hyperparameter settings as recommended in the correspond-
costly as its cost scales at least linearly with the size of the ing papers as the baseline. All tuning algorithms will be
search space.Dedicated HPO tools, on the other hand are given a small budget of up to 16 full algorithm runs as well
able to find good configurations on a significantly smaller as a larger one of 64 runs. In comparison, IDAAC’s tuning
budget by searching the whole space. A major difficulty fac- budget is 810 runs. To give an idea of the reliability of both
tor, however, is the high variability of results across seeds, the tuning algorithm and the found configurations, we tune
which is an overlooked reproducibility issue that can lead each setting 3 times across 5 seeds and test the best-found
to distorted comparisons of RL algorithms. This problem configuration on 10 unseen test seeds.
can be alleviated by tuning the algorithms on multiple seeds
As shown in Figures 5 and 6, these domains are more chal-
and evaluating them on separate test seeds.
lenging to tune on our small budgets relative to our previous
environments (for tabular results, see Appendix F). While
5. Tradeoffs for Hyperparameter we do not know how the Brax baseline agent was tuned
Optimization in Practice as this is not reported in the paper, the IDAAC baseline
uses 810 runs which is 12 times more than the large tuning
While the experiments in the previous section are meant to budget used by our HPO methods. On Brax, DEHB out-
highlight what challenges HPO tools face in RL and how performs the baseline with a mean rank of 1.3 compared to
well they overcome them, we now turn to more complex 1.7 for the 16 run budget and a rank of 1.0 compared to the
use cases of HPO. To this end, we select three challeng- baselines’s 1.3 with 64 runs. On Procgen the comparison is
ing environments each from Brax (Freeman et al., 2021) similar with 1.7 to 2 for 16 runs and 1.0 to 1.3 for 64 runs
(Ant, Halfcheetah and Humanoid) and three from Proc- (see Appendix D.4 for details on how the rank is computed).
gen (Cobbe et al., 2020) (Bigfish, Climber and Plunder) We also see that DEHB’s incumbent and test scores improve
and automatically tune the state-of-the-art RL algorithms the most consistently out of all the tuning methods, with the
on these domains (PPO for Brax and IDAAC (Raileanu & additional run budget being utilized especially well on Brax.
Fergus, 2021) for Procgen). Our goal here is simple: we RS, as expected cannot match this performance, ranking 2.3
want to see if HPO tools can improve upon the state of the and 2.7 for 16 runs and 3.3 and 3 for 64 runs respectively.
art in these domains with respect to both final performance We also see poor scaling behavior in some cases e.g. RS
and compute overhead. As we now want to compare abso- with a larger budget overfits to the tuning seeds on Brax
lute performance values on a more complex problems with while failing to improve on Procgen. As above, we see an
a bigger budget, we use BGT (Wan et al., 2022) as the state- instance of PB2 performing around 5 times worse on the
of-the-art population-based approach, and DEHB since it test seeds compared to the incumbent on Bigfish, further
is among the best solver currently available (Eggensperger suggesting that certain PBT variants may struggle to gen-
et al., 2021). As before, we use RS as an example of a eralize in such settings. On the other environments it does
simple-to-implement tuning algorithm with minimal over- better, however, earning a Procgen rank of 2 on the 16 run
head. In view of the results of Turner et al. (2021) on HPO budget, matching the baseline. With a budget of 64 runs, it
7
Hyperparameters in RL and How To Tune Them 8
ranks 2.7, the same as BGT and above RS. BGT does not their best practices, as e.g. stated by Eggensperger et al.
overfit to the same degree as PB2 but performs worse on (2019) and Lindauer & Hutter (2020), and using their HPO
lower budgets, ranking 3.8 on Procgen for 16 runs and 2.7 tools which can lead to strong performance as shown in this
for 64. On Brax, it fails to find good configurations with paper. One notable good practice is to use separate seeds
the exception of a single run on Ant (rank 3). We do not for tuning and testing hyperparameter configurations. Other
restart the BGT optimization after a set amount of failures, examples include standardizing the tuning budget for the
however, in order to keep within our small maximum bud- baselines and proposed methods, as well as tuning on the
gets. The original paper indicates that it is likely BGT will training and not the test setting.While HPO in RL provides
perform much better given a less restrictive budget. unique challenges such as the dynamic nature of the train-
ing loop or the strong sensitivity to the random seed, we
Overall, HPO tools conceived for the AC setting, as rep-
observe significant improvements in both final performance
resented by DEHB, are the most consistent and reliable
and compute cost by employing state-of-the-art AutoML
within our experimental setting. Random Search, while not
approaches. This can be done by integrating multi-fidelity
a bad choice on smaller budgets, does not scale as well with
evaluations into the population-based framework or using
the number of tuning runs. Population-based methods can-
optimization tools like DEHB and SMAC.
not match either; PB2, while finding very well performing
incumbent configurations struggles with overfitting, while Integrate Tuning Into The Development Pipeline For fair
BGT would likely benefit from larger budgets than used here. comparisons and realistic views of RL methods, we have
Further research into this optimization paradigm that pri- to use competently tuned baselines. More specifically, the
orities general configurations over incumbent performance proposed method and baselines should use the same tun-
could lead to additional improvements. ing budget and be evaluated on test seeds which should be
different from the tuning seeds. Integrating HPO into RL
Across both benchmarks we see large discrepancies between
codebases is a major step towards facilitating such com-
the incumbent and test performance. This underlines our
parisons. Some RL frameworks have started to include
earlier point about the importance of using different test and
options for automated HPO (Huang et al., 2021; Liaw et al.,
tuning seeds for reporting. In terms of compute overhead,
2018) or provide recommended hyperparameters for a set
all tested HPO methods had negligible effects on the total
of environments (Raffin et al., 2021) (although usually not
runtime, with BGT, by far the most expensive one, utilising
how they were obtained). The choice of tuning tools for
on average under two minutes of time to produce new con-
each library is still relatively limited, however, while pro-
figurations for the 16 run budget and less than 2 hours for
vided hyperparameters are not always well documented and
the 64 run budget, with all other approaches staying under
typically do not transfer well to other environments or algo-
5 minutes in each budget. Overall, we see that even com-
rithms. Thus, we hope our versatile and easy-to-use HPO
putationally cheap methods with small tuning budgets can
implementations that can be applied to any RL algorithm
generally match or outperform painstakingly hand-tuned
and environment will encourage broader use of HPO in RL
configurations that use orders of magnitude more compute.
(see Appendix B for more information). In the future, we
hope more RL libraries include AutoRL approaches since
6. Recommendations & Best Practices in a closed ecosystem, more sophisticated methods that go
beyond black-box optimizers (e.g. gradient-based methods,
Our experiments show the benefit of comprehensive hyper-
neuro-evolution, or meta-learned hyperparameter agents à
parameter tuning in terms of both final performance and
la DAC) could be deployed more easily.
compute cost, as well as how common overfitting to the set
of tuning seeds is. As a result of our insights, we recommend A Recipe For Efficient RL Research To summarize, we
some good practices for HPO in RL going forward. recommend the following step-by-step process for tuning
and selecting hyperparameters in RL:
Complete Reporting We still find that many RL papers do
not state how they obtain their hyperparameter configura-
tions, if they are included at all. As we have seen, however, 1. Define a training and test set which can include:
unbiased comparisons should not take place on the same (a) environment variations
seeds the hyperparameters are tuned on. Hence, reporting (b) random seeds for non-deterministic environments
the tuning seeds, the test seeds, and the exact protocol used (c) random seeds for initial state distributions
for hyperparameter selection, should be standard practice to
(d) random seeds for the agent (including network
ensure a sound comparison across RL methods.
initialization)
Adopting AutoML Standards In many ways, the AutoML (e) training random seeds for the HPO tool
community is ahead of the RL community regarding hyper- 2. Define a configuration space with all hyperparameters
parameter tuning. We can leverage this by learning from that likely contribute to training success;
8
Hyperparameters in RL and How To Tune Them 9
Figure 6: Tuning Results for IDAAC on Procgen. Shown is the mean evaluation reward across 10 episodes for 3 tuning runs as well as the
98% confidence interval across tuning runs.
3. Decide which HPO method to use; parameter sweeps or grid searches. We provide versatile and
4. Define the limitations of the HPO method, i.e. the bud- easy-to-use implementations of these tools which can be
get (or use self-terminating (Makarova et al., 2022)); applied to any RL algorithm and environment. We hope this
5. Settle on a cost metric – this should be an evaluation re- will encourage the adoption of AutoML best practices by the
ward across as many episodes as a needed for a reliable RL community, which should enhance the reproducibility
performance estimate; of RL results and make solving new domains simpler.
6. Run this HPO method on the training set across a num- Nevertheless, there is a lot of potential for developing HPO
ber of tuning seeds; approaches tailored to the key challenges of RL such as
7. Evaluate the resulting incumbent configurations on the the high sensitivty to the random seed for a given hyper-
test set across a number of separate test seeds and parameter configuration. Frameworks for learnt hyperpa-
report the results. rameter policies or gradient-based optimization methods
could counteract this effect by reacting dynamically to an
To ensure a fair comparison, this procedure should be fol- algorithm’s behaviour on a given seed. We believe this
lowed for all RL methods used, including the baselines. is a promising direction for future work since in our ex-
If existing hyperparameters are re-used, their source and periments, PBT methods yield fairly static configurations
tuning protocol should be reported. In addition, their cor- instead of flexible schedules. Benchmarks like the recent
responding budget and search space should be the same as AutoRL-Bench (Shala et al., 2022) accelerate progress by
those of the other RL methods used for comparison. In case comparing AutoRL tools without the need for RL algorithm
the budget is runtime and not e.g. a number of environment evaluations. Lastly, higher-level AutoRL approaches that
steps, it is also important to use comparable hardware for all do not aim to find hyperparameter values but replace them
runs. Furthermore, it is important to use the same test seeds entirely by directing the algorithm’s behavior could in the
for all configurations that are separate from all tuning seeds. long term both simplify and stabilize RL algorithms. Ex-
If this information is not available, re-tuning the algorithm amples include exploration strategies (Zhang et al., 2021b),
is preferred. This procedure, including all information on learnt optimizers (Metz et al., 2022) or entirely new algo-
the search space, cost metric, HPO method settings, seeds rithms (Co-Reyes et al., 2021; Lu et al., 2022).
and final hyperparameters should be reported. We provide a
checklist containing all of these points in Appendix A and
as a LaTeX template in our GitHub repository. Acknowledgements
Marius Lindauer acknowledges funding by
7. Conclusion the European Union (ERC, “ixAutoML”,
grant no.101041029). Views and opinions expressed are
We showed that hyperparameters in RL deserve more at- those of the author(s) only and do not necessarily reflect
tention from the research community than they currently those of the European Union or the European Research
receive. Underreported tuning practices have the potential Council Executive Agency. Neither the European Union nor
to distort algorithm evaluations while ignored hyperparame- the granting authority can be held responsible for them.
ters may lead to suboptimal performance. With only small
budgets, we demonstrate that HPO tools like DEHB can
cover large search spaces to produce better performing con-
figurations using fewer computational resources than hyper-
9
Hyperparameters in RL and How To Tune Them 10
References J., Wolski, F., and Zhang, S. Dota 2 with large scale deep
reinforcement learning. CoRR, abs/1912.06680, 2019.
Adriaensen, S., Biedenkapp, A., Shala, G., Awad, N., Eimer,
T., Lindauer, M., and Hutter, F. Automated dynamic Biedenkapp, A., Bozkurt, H. F., Eimer, T., Hutter, F., and
algorithm configuration. Journal of Artificial Intelligence Lindauer, M. Dynamic Algorithm Configuration: Foun-
Research, 2022. dation of a New Meta-Algorithmic Framework. In Lang,
Agarwal, R., Schwarzer, M., Castro, P., Courville, A., and J., Giacomo, G. D., Dilkina, B., and Milano, M. (eds.),
Bellemare, M. Deep reinforcement learning at the edge Proceedings of the Twenty-fourth European Conference
of the statistical precipice. In Ranzato, M., Beygelzimer, on Artificial Intelligence (ECAI’20), pp. 427–434, June
A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), 2020.
Advances in Neural Information Processing Systems 34:
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Annual Conference on Neural Information Processing
Schulman, J., Tang, J., and Zaremba, W. Openai gym,
Systems 2021, NeurIPS, pp. 29304–29320, 2021.
2016.
Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M.
Optuna: A next-generation hyperparameter optimization Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalis-
framework. In Proceedings of the 25rd ACM SIGKDD tic gridworld environment for gymnasium, 2018. URL
International Conference on Knowledge Discovery and https://github.com/Farama-Foundation/
Data Mining, 2019. Minigrid.
Andrychowicz, M., Raichuk, A., Stanczyk, P., Orsini, M., Co-Reyes, J., Miao, Y., Peng, D., Real, E., Le, Q., Levine,
Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, S., Lee, H., and Faust, A. Evolving reinforcement
O., Michalski, M., Gelly, S., and Bachem, O. What learning algorithms. In 9th International Conference
matters for on-policy deep actor-critic methods? A large- on Learning Representations, ICLR. OpenReview.net,
scale study. In 9th International Conference on Learning 2021. URL https://openreview.net/forum?
Representations, ICLR. OpenReview.net, 2021. id=0XXpJ4OtjW.
Awad, N., Mallik, N., and Hutter, F. DEHB: evolutionary Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever-
hyberband for scalable, robust and efficient hyperparam- aging procedural generation to benchmark reinforcement
eter optimization. In Zhou, Z. (ed.), Proceedings of the learning. In Proceedings of the 37th International Con-
Thirtieth International Joint Conference on Artificial In- ference on Machine Learning, ICML, volume 119 of Pro-
telligence, IJCAI, pp. 2147–2153. ijcai.org, 2021. ceedings of Machine Learning Research, pp. 2048–2056.
PMLR, 2020.
Badia, A., Piot, B., Kapturowski, S., Sprechmann, P., Vitvit-
skyi, A., Guo, Z., and Blundell, C. Agent57: Outper- Duan, Y., Schulman, J., Chen, X., Bartlett, P., Sutskever,
forming the atari human benchmark. In Proceedings of I., and Abbeel, P. Rl$ˆ2$: Fast reinforcement learning
the 37th International Conference on Machine Learning, via slow reinforcement learning. CoRR, abs/1611.02779,
ICML, volume 119 of Proceedings of Machine Learning 2016.
Research, pp. 507–517. PMLR, 2020.
Eggensperger, K., Lindauer, M., and Hutter, F. Neural net-
Bakshy, E., Dworkin, L., Karrer, B., Kashin, K., Letham, works for predicting algorithm runtime distributions. In
B., Murthy, A., and Singh, S. Ae: A domain-agnostic Lang, J. (ed.), Proceedings of the Twenty-Seventh Interna-
platform for adaptive experimentation. 2018. tional Joint Conference on Artificial Intelligence (IJCAI),
Bechtle, S., Molchanov, A., Chebotar, Y., Grefenstette, E., pp. 1442–1448. ijcai.org, 2018.
Righetti, L., Sukhatme, G., and Meier, F. Meta learning
Eggensperger, K., Lindauer, M., and Hutter, F. Pitfalls and
via learned loss. In 25th International Conference on
best practices in algorithm configuration. pp. 861–893,
Pattern Recognition, ICPR, pp. 4161–4168. IEEE, 2020.
2019.
Bergstra, J. and Bengio, Y. Random search for hyper-
parameter optimization. 13:281–305, 2012. Eggensperger, K., Müller, P., Mallik, N., Feurer, M., Sass,
R., Klein, A., Awad, N., Lindauer, M., and Hutter, F.
Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Hpobench: A collection of reproducible multi-fidelity
Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, benchmark problems for HPO. In Vanschoren, J. and
C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Yeung, S. (eds.), Proceedings of the Neural Information
Petrov, M., de Oliveira Pinto, H., Raiman, J., Salimans, T., Processing Systems Track on Datasets and Benchmarks
Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, 1, NeurIPS Datasets and Benchmarks, 2021.
10
Hyperparameters in RL and How To Tune Them 11
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.,
F., Rudolph, L., and Madry, A. Implementation matters Donahue, J., Razavi, A., Vinyals, O., Green, T., Dun-
in deep RL: A case study on PPO and TRPO. In 8th ning, I., Simonyan, K., Fernando, C., and Kavukcuoglu,
International Conference on Learning Representations, K. Population based training of neural networks.
ICLR. OpenReview.net, 2020. arXiv:1711.09846 [cs.LG], 2017.
Flennerhag, S., Schroecker, Y., Zahavy, T., van Hasselt, H., Kiran, M. and Ozyildirim, B. Hyperparameter tuning
Silver, D., and Singh, S. Bootstrapped meta-learning. In for deep reinforcement learning applications. CoRR,
The Tenth International Conference on Learning Repre- abs/2201.11182, 2022. URL https://arxiv.org/
sentations, ICLR. OpenReview.net, 2022. abs/2201.11182.
Franke, J., Köhler, G., Biedenkapp, A., and Hutter, F.
Sample-efficient automated deep reinforcement learning. Li, A., Spyra, O., Perel, S., Dalibard, V., Jaderberg, M., Gu,
In 9th International Conference on Learning Representa- C., Budden, D., Harley, T., and Gupta, P. A generalized
tions, ICLR. OpenReview.net, 2021. framework for population based training. In Teredesai,
A., Kumar, V., Li, Y., Rosales, R., Terzi, E., and Karypis,
Franke, J. K., Köhler, G., Biedenkapp, A., and Hutter, F. G. (eds.), Proceedings of the 25th ACM SIGKDD Inter-
Sample-efficient automated deep reinforcement learning. national Conference on Knowledge Discovery & Data
arXiv:2009.01555 [cs.LG], 2020. Mining, KDD, pp. 1791–1799. ACM, 2019.
Freeman, C., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and
and Bachem, O. Brax - A differentiable physics engine for Talwalkar, A. Hyperband: A novel bandit-based approach
large scale rigid body simulation. In Vanschoren, J. and to hyperparameter optimization. 18(185):1–52, 2018.
Yeung, S. (eds.), Proceedings of the Neural Information
Processing Systems Track on Datasets and Benchmarks Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.,
1, NeurIPS Datasets and Benchmarks 2021, 2021. and Stoica, I. Tune: A research platform for distributed
model selection and training. CoRR, abs/1807.05118,
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft
2018.
actor-critic: Off-policy maximum entropy deep reinforce-
ment learning with a stochastic actor. In Dy, J. G. and
Lindauer, M. and Hutter, F. Best practices for scientific re-
Krause, A. (eds.), Proceedings of the 35th International
search on neural architecture search. Journal of Machine
Conference on Machine Learning, ICML, volume 80 of
Learning Research, 21:1–18, 2020.
Proceedings of Machine Learning Research, pp. 1856–
1865. PMLR, 2018. Lindauer, M., Eggensperger, K., Feurer, M., Biedenkapp,
Hambro, E., Raileanu, R., Rothermel, D., Mella, V., A., Deng, D., Benjamins, C., Ruhkopf, T., Sass, R., and
Rocktäschel, T., Küttler, H., and Murray, N. Dungeons Hutter, F. SMAC3: A versatile bayesian optimization
and data: A large-scale nethack dataset. 2022. package for hyperparameter optimization. J. Mach. Learn.
Res., 23:54:1–54:9, 2022.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Pre-
cup, D., and Meger, D. Deep reinforcement learning Lu, C., Kuba, J., Letcher, A., Metz, L., de Witt, C., and
that matters. In McIlraith, S. and Weinberger, K. (eds.), Foerster, J. Discovered policy optimisation. CoRR,
Proceedings of the Conference on Artificial Intelligence abs/2210.05639, 2022.
(AAAI’18). AAAI Press, 2018.
Makarova, A., Shen, H., Perrone, V., Klein, A., Faddoul, J.,
Hsu, C., Mendler-Dünner, C., and Hardt, M. Revisiting Krause, A., Seeger, M., and Archambeau, C. Automatic
design choices in proximal policy optimization. CoRR, termination for hyperparameter optimization. In Guyon,
abs/2009.10897, 2020. I., Lindauer, M., van der Schaar, M., Hutter, F., and Gar-
Huang, S., Dossa, R., Ye, C., and Braga, J. Cleanrl: High- nett, R. (eds.), International Conference on Automated
quality single-file implementations of deep reinforcement Machine Learning, AutoML, volume 188 of Proceedings
learning algorithms. CoRR, abs/2111.08819, 2021. of Machine Learning Research, pp. 7/1–21. PMLR, 2022.
Hutter, F., Hoos, H., and Leyton-Brown, K. An efficient ap- Metz, L., Harrison, J., Freeman, C., Merchant, A., Beyer,
proach for assessing hyperparameter importance. In Xing, L., Bradbury, J., Agrawal, N., Poole, B., Mordatch,
E. and Jebara, T. (eds.), Proceedings of the 31th Interna- I., Roberts, A., and Sohl-Dickstein, J. Velo: Train-
tional Conference on Machine Learning, (ICML’14), pp. ing versatile learned optimizers by scaling up. CoRR,
754–762. Omnipress, 2014. abs/2211.09760, 2022.
11
Hyperparameters in RL and How To Tune Them 12
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- Sass, R., Bergman, E., Biedenkapp, A., Hutter, F., and
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Lindauer, M. Deepcave: An interactive analysis tool
Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., for automated machine learning. CoRR, abs/2206.03493,
Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wier- 2022.
stra, D., Legg, S., and Hassabis, D. Human-level control
through deep reinforcement learning. Nature, 518(7540): Schede, E., Brandt, J., Tornede, A., Wever, M., Bengs, V.,
529–533, 2015. Hüllermeier, E., and Tierney, K. A survey of methods for
automated algorithm configuration. J. Artif. Intell. Res.,
Obando-Ceron, J. and Castro, P. Revisiting rainbow: Pro- 75:425–487, 2022.
moting more insightful and inclusive deep reinforcement
learning research. In Meila, M. and Zhang, T. (eds.), Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Proceedings of the 38th International Conference on Ma- Klimov, O. Proximal policy optimization algorithms.
chine Learning, ICML, volume 139 of Proceedings of Ma- arXiv:1707.06347 [cs.LG], 2017.
chine Learning Research, pp. 1373–1383. PMLR, 2021. Shala, G., Arango, S., Biedenkapp, A., Hutter, F., and
Grabocka, J. Autorl-bench 1.0. In Workshop on Meta-
Parker-Holder, J., Nguyen, V., and Roberts, S. Prov-
Learning (MetaLearn@NeurIPS’22), 2022.
ably efficient online hyperparameter optimization with
population-based bandits. In Larochelle, H., Ranzato, M., Storn, R. and Price, K. Differential evolution - A simple and
Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances efficient heuristic for global optimization over continuous
in Neural Information Processing Systems 33: Annual spaces. J. Glob. Optim., 11(4):341–359, 1997.
Conference on Neural Information Processing Systems
2020, NeurIPS, 2020. Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laakso-
nen, E., Xu, Z., and Guyon, I. Bayesian optimization is
Parker-Holder, J., Rajan, R., Song, X., Biedenkapp, A., superior to random search for machine learning hyperpa-
Miao, Y., Eimer, T., Zhang, B., Nguyen, V., Calandra, R., rameter tuning: Analysis of the black-box optimization
Faust, A., Hutter, F., and Lindauer, M. Automated rein- challenge 2020. CoRR, abs/2104.10201, 2021.
forcement learning (autorl): A survey and open problems.
J. Artif. Intell. Res., 74:517–568, 2022. Wan, X., Lu, C., Parker-Holder, J., Ball, P., Nguyen, V., Ru,
B., and Osborne, M. Bayesian generational population-
Paul, S., Kurin, V., and Whiteson, S. Fast efficient hyperpa- based training. In Guyon, I., Lindauer, M., van der Schaar,
rameter tuning for policy gradient methods. In Wallach, M., Hutter, F., and Garnett, R. (eds.), International Con-
H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., ference on Automated Machine Learning, AutoML, vol-
Fox, E. B., and Garnett, R. (eds.), Advances in Neural In- ume 188 of Proceedings of Machine Learning Research,
formation Processing Systems 32: Annual Conference on pp. 14/1–27. PMLR, 2022.
Neural Information Processing Systems 2019, NeurIPS,
pp. 4618–4628, 2019. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot,
M., and de Freitas, N. Dueling network architectures for
Pushak, Y. and Hoos, H. H. Automl loss landscapes. ACM deep reinforcement learning. In Balcan, M. and Wein-
Trans. Evol. Learn. Optim., 2(3):10:1–10:30, 2022. berger, K. (eds.), Proceedings of the 33rd International
Conference on Machine Learning (ICML’17), volume 48,
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, pp. 1995–2003. Proceedings of Machine Learning Re-
M., and Dormann, N. Stable-baselines3: Reliable re- search, 2016.
inforcement learning implementations. J. Mach. Learn.
Res., 22:268:1–268:8, 2021. Xu, Z., van Hasselt, H., and Silver, D. Meta-gradient re-
inforcement learning. In Bengio, S., Wallach, H. M.,
Raileanu, R. and Fergus, R. Decoupling value and policy Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Gar-
for generalization in reinforcement learning. In Meila, nett, R. (eds.), Advances in Neural Information Process-
M. and Zhang, T. (eds.), Proceedings of the 38th Interna- ing Systems 31: Annual Conference on Neural Informa-
tional Conference on Machine Learning, ICML, volume tion Processing Systems 2018, NeurIPS, pp. 2402–2413,
139 of Proceedings of Machine Learning Research, pp. 2018.
8787–8798. PMLR, 2021.
Xu, Z., van Hasselt, H., Hessel, M., Oh, J., Singh, S., and
Raileanu, R. and Rocktäschel, T. RIDE: rewarding impact- Silver, D. Meta-gradient reinforcement learning with an
driven exploration for procedurally-generated environ- objective discovered online. In Larochelle, H., Ranzato,
ments. In 8th International Conference on Learning M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances
Representations, ICLR. OpenReview.net, 2020. in Neural Information Processing Systems 33: Annual
12
Hyperparameters in RL and How To Tune Them 13
13
Hyperparameters in RL and How To Tune Them 14
1. Are there training and test settings available on your chosen domains?
If yes:
• Is only the training setting used for training? ✓✗
• Is only the training setting used for tuning? ✓✗
• Are final results reported on the test setting? ✓✗
2. Hyperparameters were tuned using <package-name> which is based on <an-optimization-method>
3. The configuration space was: <algorithm-1>:
• <a-continuous-hyperparameter>: (<lower>, <upper>)
• <a-logspaced-continuous-hyperparameter>: log((<lower>, <upper>))
• <a-discrete-hyperparameter>: [<lower>, <upper>]
• <a-categorical-hyperparameter>: <choice-a>, <choice-b>
• ...
<algorithm-2>:
• <an-additional-hyperparameter>: (<lower>, <upper>)
• ...
4. The search space contains the same hyperparameters and search ranges wherever algorithms share hyperparameters ✓✗
If no, why not?
5. The cost metric(s) optimized was/were <a-cost-metric>
6. The tuning budget was <the-budget>
7. The tuning budget was the same for all tuned methods ✓✗
If no, why not?
8. If the budget is given in time: the hardware used for all tuning runs is comparable ✓✗
9. All methods that were reported were tuned with this the methods and settings described above ✓✗
If no, why not?
10. Tuning was done across < n > tuning seeds which were: [< 0 >, < 1 >, < 2 >, < 3 >, < 4 >]
11. Testing was done across < m > test seeds which were: [< 5 >, < 6 >, < 7 >, < 8 >, < 9 >]
12. Are all results reported on the test seeds? ✓✗
If no, why not?
13. The final incumbent configurations reported were:
<algorithm-1-env-1>:
• <a-hyperparameter>: <value>
• ...
<algorithm-1-env-2>:
• <a-hyperparameter>: <value>
14
Hyperparameters in RL and How To Tune Them 15
• ...
<algorithm-2-env-1>:
• <a-hyperparameter>: <value>
• ...
14. The code for reproducing these experiments is available at: <a-link>
15. The code also includes the tuning process ✓✗
16. Bundled with the code is an exact version of the original software environment, e.g. a conda environment file with all
package versions or a docker image in case some dependencies are not conda installable ✓✗
17. The following hardware was used in running the experiments:
• <n> <gpu-types>
• ...
15
Hyperparameters in RL and How To Tune Them 16
Figure 7: A base Hydra configuration file (left) and the changes necessary to tune this algorithm with DEHB (right).
Figure 8: Example definition of a search space for PPO in a separate configuraion file.
16
Hyperparameters in RL and How To Tune Them 17
C.2. DEHB
DEHB is the combination of the evolutionary algorithm Differential Evolution (DE) (Storn & Price, 1997) and the multi-
fidelity method HyperBand (Li et al., 2018). HyperBand as a multi-fidelity method is based on the idea of running many
configurations with a small budget, i.e. only a fraction of training steps, and progressing promising ones to the next
higher budget level. In this way we see many datapoints, but avoid spending time on bad configurations. DEHB starts
with a full set of HyperBand budgets, from very low to full budget, and runs it in its first iteration. For each budget,
1
DEHB runs the equivalent of one full algorithm run in steps, e.g. if the current budget is 10 of the full run budget, 10
configurations will be evaluated. For the second one, the lowest budget is left out and the second lowest is initialised with a
population of configurations evolved by DEHB from the previous iteration’s results. This procedure continues until either a
maximum number of iterations is reached or only the full run budget budget is left. The number of budgets is decided by a
hyperparameter η.
In our experiments in Section 4 we run 3 iterations with η = 5 so only 3 budgets, and in our larger DEHB experiments in
1
Section 5 we use 2 iterations with η = 1.9 so 8 budgets. We set the minimum budget as 100 of the full run training steps in
each case.
17
Hyperparameters in RL and How To Tune Them 18
train freq 4 4
gradient steps 1 1
exploration fraction 0.1 0.1
exploration initial eps 1.0 1.0
exploration final eps 0.05 0.05
buffer size 1000000 1000000
Table 3: StableBaselines hyperparameter defaults for different environments.
18
Hyperparameters in RL and How To Tune Them 19
19
Hyperparameters in RL and How To Tune Them 20
D.6. Hardware
All of our experiments were run on a compute cluster with two Intel CPUs per node (these were used for the experiments in
Section 4) and four different node configurations for GPU (used for the experiments in Section 5). These configurations
are: 2 Pascal 130 GPUs, 2 Pascal 144 GPUs, 8 Volta 10 GPUs or 8 Volta 332. We ran the CPU experiments with 10GB of
memory on single nodes and the GPU experiments with 10GB for Procgen and 40GB for Brax on a single GPU each.
20
Hyperparameters in RL and How To Tune Them 21
21
Hyperparameters in RL and How To Tune Them 22
Table 8: Tuning PPO on Brax’s Ant, Halfcheetah and Humanoid environments. Shown are tuning results across 3 runs across 5 seeds
each, tested on 10 different test seeds.
Table 9: Tuning IDAAC on Procgen’s Bigfish, Climber and Plunder. Results are across 3 runs using 5 seeds each, and tested on 10
different test seeds.
22
Hyperparameters in RL and How To Tune Them 23
23
Hyperparameters in RL and How To Tune Them 24
24
Hyperparameters in RL and How To Tune Them 25
25
Hyperparameters in RL and How To Tune Them 26
26
Hyperparameters in RL and How To Tune Them 27
27
Hyperparameters in RL and How To Tune Them 28
28
Hyperparameters in RL and How To Tune Them 29
29
Hyperparameters in RL and How To Tune Them 30
30
Hyperparameters in RL and How To Tune Them 31
Figure 17: Final returns across 5 seeds for different hp variations of SAC on Pendulum and Ant.
31
Hyperparameters in RL and How To Tune Them 32
Figure 18: Final returns across 5 seeds for different hp variations of SAC of Halfcheetah and Humanoid.
32
Hyperparameters in RL and How To Tune Them 33
Figure 19: Final returns across 5 seeds for different hp variations of DQN on Acrobot.
33
Hyperparameters in RL and How To Tune Them 34
Figure 20: Final returns across 5 seeds for different hp variations of DQN on MiniGrid.
34
Hyperparameters in RL and How To Tune Them 35
Figure 21: Final returns across 5 seeds for different hp variations of PPO on Pendulum and Acrobot.
35
Hyperparameters in RL and How To Tune Them 36
Figure 22: Final returns across 5 seeds for different hp variations of PPO on Ant and Halfcheetah.
36
Hyperparameters in RL and How To Tune Them 37
Figure 23: Final returns across 5 seeds for different hp variations of PPO on Humanoid.
37
Hyperparameters in RL and How To Tune Them 38
Figure 24: Final returns across 5 seeds for different hp variations of PPO on MiniGrid.
38
Hyperparameters in RL and How To Tune Them 39
Figure 25: PPO Hyperparameter Importances on Acrobot (left) and Pendulum (right).
Figure 26: PPO Hyperparameter Importances on MiniGrid Empty (left) and Minigrid DoorKey(right).
39
Hyperparameters in RL and How To Tune Them 40
Figure 27: PPO Hyperparameter Importances on Brax Ant (left), Halfcheetah (middle) and Humanoid (right).
Figure 28: DQN Hyperparameter Importances on Acrobot (left), MiniGrid Empty (middle) and MiniGrid DoorKey (right).
Figure 29: SAC Hyperparameter Importances on Pendulum (left) and Brax Ant(right).
Figure 30: SAC Hyperparameter Importances on Brax Halfcheetah (left) and Humanoid (right).
40
Hyperparameters in RL and How To Tune Them 41
41
Hyperparameters in RL and How To Tune Them 42
42
Hyperparameters in RL and How To Tune Them 43
43
Hyperparameters in RL and How To Tune Them 44
44
Hyperparameters in RL and How To Tune Them 45
45
Hyperparameters in RL and How To Tune Them 46
46