-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
❓ Question
Hi, I am working with both PPO and A2C, and PPO is working great for me. I have dialed in the PPO hyperparameters well for my case, but when I use A2C with the same (or as close as I can get) hyperparameters, training is having issues.
I know they are different algorithms so I have also tried some exploration using guidance from the documentation but I just can't get close to the same performance. I've come across this paper which shows it is possible to go from A2C -> PPO (or making PPO behave like A2C) and the steps needed, but is it possible to go the other way? If so, how?
My best performance was (PPO in orange, A2C in blue):

With other attempts not being close (PPO in orange, A2C with various hyperparameters as other colours):
This is the network structure I use for both, with MlpPolicy, and the PPO parameters:
policy_kwargs = {
"activation_fn": th.nn.ReLU,
"net_arch": {
"pi": [size, size],
"vf": [size, size],
},
"ortho_init": True,
}
ppo_params = {
"n_steps": 6144,
"batch_size": 512,
"n_epochs": 6,
"clip_range": 0.25,
"ent_coef": 0.01,
"max_grad_norm": 0.5,
"gamma": 0.995,
"gae_lambda": 0.95,
"target_kl": 0.03,
}I always use 32 envs for training.
Checklist
- I have checked that there is no similar issue in the repo
- I have read the documentation
- If code there is, it is minimal and working
- If code there is, it is formatted using the markdown code blocks for both code and stack traces.