0% found this document useful (0 votes)
2 views5 pages

5aaa

The document outlines the specifications for training an A3C agent to play Atari Pong, utilizing a hybrid of synchronous and asynchronous methods with 16 workers across 8 environments. It discusses the benefits of parallelization, including faster training and more diverse data, and highlights the importance of choosing the right method based on implementation ease, compute cost, and scale. Additionally, it emphasizes that not all problems require parallelization, particularly for off-policy algorithms like DQN that can perform well without it.

Uploaded by

Monday ABDOUL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

5aaa

The document outlines the specifications for training an A3C agent to play Atari Pong, utilizing a hybrid of synchronous and asynchronous methods with 16 workers across 8 environments. It discusses the benefits of parallelization, including faster training and more diverse data, and highlights the importance of choosing the right method based on implementation ease, compute cost, and scale. Additionally, it emphasizes that not all problems require parallelization, particularly for off-policy algorithms like DQN that can perform well without it.

Uploaded by

Monday ABDOUL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

5 "agent": [{

6 "name": "A3C",
7 "algorithm": {
8 "name": "ActorCritic",
9 "action_pdtype": "default",
10 "action_policy": "default",
11 "explore_var_spec": null,
12 "gamma": 0.99,
13 "lam": null,
14 "num_step_returns": 5,
15 "entropy_coef_spec": {
16 "name": "no_decay",
17 "start_val": 0.01,
18 "end_val": 0.01,
19 "start_step": 0,
20 "end_step": 0
21 },
22 "val_loss_coef": 0.5,
23 "training_frequency": 5
24 },
25 "memory": {
26 "name": "OnPolicyBatchReplay",
27 },
28 "net": {
29 "type": "ConvNet",
30 "shared": true,
31 "conv_hid_layers": [
32 [32, 8, 4, 0, 1],
33 [64, 4, 2, 0, 1],
34 [32, 3, 1, 0, 1]
35 ],
36 "fc_hid_layers": [512],
37 "hid_layers_activation": "relu",
38 "init_fn": "orthogonal_",
39 "normalize": true,
40 "batch_norm": false,
41 "clip_grad_val": 0.5,
42 "use_same_optim": false,
43 "loss_spec": {

WOW! eBook
www.wowebook.org
44 "name": "MSELoss"
45 },
46 "actor_optim_spec": {
47 "name": "GlobalAdam",
48 "lr": 1e-4
49 },
50 "critic_optim_spec": {
51 "name": "GlobalAdam",
52 "lr": 1e-4
53 },
54 "lr_scheduler_spec": null,
55 "gpu": false
56 }
57 }],
58 "env": [{
59 "name": "PongNoFrameskip-v4",
60 "frame_op": "concat",
61 "frame_op_len": 4,
62 "reward_scale": "sign",
63 "num_envs": 8,
64 "max_t": null,
65 "max_frame": 1e7
66 }],
67 "body": {
68 "product": "outer",
69 "num": 1
70 },
71 "meta": {
72 "distributed": "synced",
73 "log_frequency": 10000,
74 "eval_frequency": 10000,
75 "max_session": 16,
76 "max_trial": 1,
77 }
78 }
79 }

Code 8.2: A3C spec file to play Atari Pong.

WOW! eBook
www.wowebook.org
In Code 8.2 the meta spec "distributed": "synced"
(line:72), and specify the number of workers as max_session
to 16 (line:75). The optimizer is changed to variant
GlobalAdam (line:47) that is more suitable for Hogwild! We
also change the number of environments num_envs to 8
(line:8). Note that if the number of environments is greater than
1, the algorithm becomes a hybrid of synchronous (vector
environments) and asynchronous (Hogwild!) methods, and
there will be num_envs × max_session workers.
Conceptually, this can be thought of as a hierarchy of Hogwild!
workers which each spawn a number of synchronous workers.
To train this A3C agent with n-step returns using SLM Lab, run
the commands shown in Code 8.3 in a terminal.

1 conda activate lab


2 python run_lab.py
slm_lab/spec/benchmark/a3c/a3c_nstep_pong.json
a3c_nstep_pong
↪ train

Code 8.3: A3C: training an agent

As usual, this will run a training Trial to produce the graphs


shown in Figure 8.1. However, note that now the sessions take
on the role of asynchronous workers. The trial should take only
a few hours to complete when running on CPUs, although it will
require a machine with at least 16 CPUs.

WOW! eBook
www.wowebook.org
Figure 8.1: The trial graphs of A3C (n-step returns)
with 16 workers. Since sessions take on the role of
workers, the horizontal axis measures the number
of frames experienced by individual worker.
Therefore, the total number of frames experienced
collectively is equal to the sum of the individual
frames which will add to 10 million total frames.

8.4 SUMMARY
In this chapter, we discussed two widely applicable

WOW! eBook
www.wowebook.org
parallelization methods — synchronous and asynchronous.
Respectively, we showed that they can be implemented using
vector environments and the Hogwild! algorithm.
The two benefits of parallelization are faster training and more
diverse data. The second benefit plays a crucial role in
stabilizing and improving the training of policy gradient
algorithms. In fact, it often makes the difference between
success and failure.
When determining which of the parallelization methods to
apply, it helps to consider the following factors — ease of
implementation, compute cost, scale.
Synchronous methods (e.g. vector environments) are often
straightforward and easier to implement than asynchronous
methods, particularly if only data gathering is parallelized. Data
generation is usually cheaper, so they require fewer resources
for the same number of frames and so scale better up to a
moderate number of workers, e.g. less than 100. However, the
synchronization barrier becomes a bottleneck when applied at a
larger scale. In this case, asynchronous methods will likely be
significantly faster.
It is not always necessary to parallelize. As a general rule, try to
understand if a problem is simple enough to be solved without
parallelization before investing time and resources to
implement it. Additionally, the need to parallelize depends on
the algorithm used. Off-policy algorithms such as DQN can
often achieve very strong performance without parallelization
since the experience replay already provides diverse training
data. Even if training takes a very long time, agents can still

WOW! eBook
www.wowebook.org

You might also like