Skip to content

[Question] A2C vs PPO parameter/performance matching #2186

@A-Artemis

Description

@A-Artemis

❓ Question

Hi, I am working with both PPO and A2C, and PPO is working great for me. I have dialed in the PPO hyperparameters well for my case, but when I use A2C with the same (or as close as I can get) hyperparameters, training is having issues.

I know they are different algorithms so I have also tried some exploration using guidance from the documentation but I just can't get close to the same performance. I've come across this paper which shows it is possible to go from A2C -> PPO (or making PPO behave like A2C) and the steps needed, but is it possible to go the other way? If so, how?

My best performance was (PPO in orange, A2C in blue):
Image

With other attempts not being close (PPO in orange, A2C with various hyperparameters as other colours):

Image

This is the network structure I use for both, with MlpPolicy, and the PPO parameters:

    policy_kwargs = {
        "activation_fn": th.nn.ReLU,
        "net_arch": {
            "pi": [size, size],
            "vf": [size, size],
        },
        "ortho_init": True,
    }

    ppo_params = {
        "n_steps": 6144,
        "batch_size": 512,
        "n_epochs":  6,
        "clip_range":  0.25,
        "ent_coef":  0.01,
        "max_grad_norm": 0.5,
        "gamma": 0.995,
        "gae_lambda": 0.95,
        "target_kl": 0.03,
    }

I always use 32 envs for training.

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions