Categorical DQN (C51)
Overview
C51 introduces a distributional perspective for DQN: instead of learning a single value for an action, C51 learns to predict a distribution of values for the action. Empirically, C51 demonstrates impressive performance in ALE.
Original papers:
Implemented Variants
Variants Implemented | Description |
---|---|
c51_atari.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
c51.py , docs |
For classic control tasks like CartPole-v1 . |
Below are our single-file implementations of C51:
c51_atari.py
The c51_atari.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/c51_atari.py --env-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --env-id PongNoFrameskip-v4
Explanation of the logged metrics
Running python cleanrl/c51_atari.py
will automatically record various metrics such as actor or value losses in Tensorboard. Below is the documentation for these metrics:
charts/episodic_return
: episodic return of the gamecharts/SPS
: number of steps per secondlosses/loss
: the cross entropy loss between the \(t\) step state value distribution and the projected \(t+1\) step state value distributionlosses/q_values
: implemented as(old_pmfs * q_network.atoms).sum(1)
, which is the sum of the probability of getting returns \(x\) (old_pmfs
) multiplied by \(x\) (q_network.atoms
), averaged over the sample obtained from the replay buffer; useful when gauging if under or over estimation happens
Implementation details
c51_atari.py is based on (Bellemare et al., 2017)1 but presents a few implementation differences:
- (Bellemare et al., 2017)1 injects stochaticity by doing "on each frame the environment rejects the agent’s selected action with probability \(p = 0.25\)", but
c51_atari.py
does not do this c51_atari.py
use a self-contained evaluation scheme:c51_atari.py
reports the episodic returns obtained throughout training, whereas (Bellemare et al., 2017)1 is trained with--end-e=0.01
but reported episodic returns using a separate evaluation process with--end-e=0.001
(See "5.2. State-of-the-Art Results" on page 7).c51_atari.py
rescales the gradient so that the norm of the parameters does not exceed0.5
like done in PPO ( ppo2/model.py#L102-L108).
Experiment results
PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.
Below are the average episodic returns for c51_atari.py
.
Environment | c51_atari.py 10M steps |
(Bellemare et al., 2017, Figure 14)1 50M steps | (Hessel et al., 2017, Figure 5)3 |
---|---|---|---|
BreakoutNoFrameskip-v4 | 467.00 ± 96.11 | 748 | ~500 at 10M steps, ~600 at 50M steps |
PongNoFrameskip-v4 | 19.32 ± 0.92 | 20.9 | ~20 10M steps, ~20 at 50M steps |
BeamRiderNoFrameskip-v4 | 9986.96 ± 1953.30 | 14,074 | ~12000 10M steps, ~14000 at 50M steps |
Note that we save computational time by reducing timesteps from 50M to 10M, but our c51_atari.py
scores the same or higher than (Mnih et al., 2015)1 in 10M steps.
Learning curves:
Tracked experiments and game play videos:
c51.py
The c51.py has the following features:
- Works with the
Box
observation space of low-level features - Works with the
Discrete
action space - Works with envs like
CartPole-v1
Usage
python cleanrl/c51.py --env-id CartPole-v1
Explanation of the logged metrics
See related docs for c51_atari.py
.
Implementation details
The c51.py shares the same implementation details as c51_atari.py
except the c51.py
runs with different hyperparameters and neural network architecture. Specifically,
c51.py
uses a simpler neural network as follows:self.network = nn.Sequential( nn.Linear(np.array(env.single_observation_space.shape).prod(), 120), nn.ReLU(), nn.Linear(120, 84), nn.ReLU(), nn.Linear(84, env.single_action_space.n), )
-
c51.py
runs with different hyperparameters:python c51.py --total-timesteps 500000 \ --learning-rate 2.5e-4 \ --buffer-size 10000 \ --gamma 0.99 \ --target-network-frequency 500 \ --max-grad-norm 0.5 \ --batch-size 128 \ --start-e 1 \ --end-e 0.05 \ --exploration-fraction 0.5 \ --learning-starts 10000 \ --train-frequency 10
Experiment results
PR vwxyzjn/cleanrl#159 tracks our effort to conduct experiments, and the reprodudction instructions can be found at vwxyzjn/cleanrl/benchmark/c51.
Below are the average episodic returns for c51.py
.
Environment | c51.py |
---|---|
CartPole-v1 | 498.51 ± 1.77 |
Acrobot-v1 | -88.81 ± 8.86 |
MountainCar-v0 | -167.71 ± 26.85 |
Note that the C51 has no official benchmark on classic control environments, so we did not include a comparison. That said, our c51.py
was able to achieve near perfect scores in CartPole-v1
and Acrobot-v1
; further, it can obtain successful runs in the sparse environment MountainCar-v0
.
Learning curves:
Tracked experiments and game play videos:
-
Bellemare, M.G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. ICML. ↩↩↩↩↩
-
[Proposal] Formal API handling of truncation vs termination. https://github.com/openai/gym/issues/2510 ↩
-
Hessel, M., Modayil, J., Hasselt, H.V., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M.G., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. AAAI. ↩