5aaa
5aaa
6 "name": "A3C",
7 "algorithm": {
8 "name": "ActorCritic",
9 "action_pdtype": "default",
10 "action_policy": "default",
11 "explore_var_spec": null,
12 "gamma": 0.99,
13 "lam": null,
14 "num_step_returns": 5,
15 "entropy_coef_spec": {
16 "name": "no_decay",
17 "start_val": 0.01,
18 "end_val": 0.01,
19 "start_step": 0,
20 "end_step": 0
21 },
22 "val_loss_coef": 0.5,
23 "training_frequency": 5
24 },
25 "memory": {
26 "name": "OnPolicyBatchReplay",
27 },
28 "net": {
29 "type": "ConvNet",
30 "shared": true,
31 "conv_hid_layers": [
32 [32, 8, 4, 0, 1],
33 [64, 4, 2, 0, 1],
34 [32, 3, 1, 0, 1]
35 ],
36 "fc_hid_layers": [512],
37 "hid_layers_activation": "relu",
38 "init_fn": "orthogonal_",
39 "normalize": true,
40 "batch_norm": false,
41 "clip_grad_val": 0.5,
42 "use_same_optim": false,
43 "loss_spec": {
WOW! eBook
www.wowebook.org
44 "name": "MSELoss"
45 },
46 "actor_optim_spec": {
47 "name": "GlobalAdam",
48 "lr": 1e-4
49 },
50 "critic_optim_spec": {
51 "name": "GlobalAdam",
52 "lr": 1e-4
53 },
54 "lr_scheduler_spec": null,
55 "gpu": false
56 }
57 }],
58 "env": [{
59 "name": "PongNoFrameskip-v4",
60 "frame_op": "concat",
61 "frame_op_len": 4,
62 "reward_scale": "sign",
63 "num_envs": 8,
64 "max_t": null,
65 "max_frame": 1e7
66 }],
67 "body": {
68 "product": "outer",
69 "num": 1
70 },
71 "meta": {
72 "distributed": "synced",
73 "log_frequency": 10000,
74 "eval_frequency": 10000,
75 "max_session": 16,
76 "max_trial": 1,
77 }
78 }
79 }
WOW! eBook
www.wowebook.org
In Code 8.2 the meta spec "distributed": "synced"
(line:72), and specify the number of workers as max_session
to 16 (line:75). The optimizer is changed to variant
GlobalAdam (line:47) that is more suitable for Hogwild! We
also change the number of environments num_envs to 8
(line:8). Note that if the number of environments is greater than
1, the algorithm becomes a hybrid of synchronous (vector
environments) and asynchronous (Hogwild!) methods, and
there will be num_envs × max_session workers.
Conceptually, this can be thought of as a hierarchy of Hogwild!
workers which each spawn a number of synchronous workers.
To train this A3C agent with n-step returns using SLM Lab, run
the commands shown in Code 8.3 in a terminal.
WOW! eBook
www.wowebook.org
Figure 8.1: The trial graphs of A3C (n-step returns)
with 16 workers. Since sessions take on the role of
workers, the horizontal axis measures the number
of frames experienced by individual worker.
Therefore, the total number of frames experienced
collectively is equal to the sum of the individual
frames which will add to 10 million total frames.
8.4 SUMMARY
In this chapter, we discussed two widely applicable
WOW! eBook
www.wowebook.org
parallelization methods — synchronous and asynchronous.
Respectively, we showed that they can be implemented using
vector environments and the Hogwild! algorithm.
The two benefits of parallelization are faster training and more
diverse data. The second benefit plays a crucial role in
stabilizing and improving the training of policy gradient
algorithms. In fact, it often makes the difference between
success and failure.
When determining which of the parallelization methods to
apply, it helps to consider the following factors — ease of
implementation, compute cost, scale.
Synchronous methods (e.g. vector environments) are often
straightforward and easier to implement than asynchronous
methods, particularly if only data gathering is parallelized. Data
generation is usually cheaper, so they require fewer resources
for the same number of frames and so scale better up to a
moderate number of workers, e.g. less than 100. However, the
synchronization barrier becomes a bottleneck when applied at a
larger scale. In this case, asynchronous methods will likely be
significantly faster.
It is not always necessary to parallelize. As a general rule, try to
understand if a problem is simple enough to be solved without
parallelization before investing time and resources to
implement it. Additionally, the need to parallelize depends on
the algorithm used. Off-policy algorithms such as DQN can
often achieve very strong performance without parallelization
since the experience replay already provides diverse training
data. Even if training takes a very long time, agents can still
WOW! eBook
www.wowebook.org