Skip to content

Scaling delay for pod provisioning with higher job spikes #3276

@ventsislav-georgiev

Description

@ventsislav-georgiev

Checks

Controller Version

0.8.2

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Reproducible by modifying the min runners of an AutoscalingRunnerSet.
From 0 to 30, 100, 400.

Describe the bug

We are experiencing scaling issues during higher demand. If our CI triggers big amount of jobs the gha-runner-scale-set has hard time spinning pods and they are seem like stuck in pending state.

Here are some screen captures of scaling from 0 to 30, to 100 and to 400 runners:

0 to 30 (took 15s to create 30 pods)

30target_15s_to_30pod.mov

0 to 100 (took 40s to create 30 pods)

100target_40s_to_30pod.mov

0 to 400 (took 2m 8s to create 30 pods)

400target_2m8s_to_30pod.mov

Describe the expected behavior

The scaling speed to first pods should be the same. Otherwise the CI slows down the moment it is needed most.

Additional Context

-

Controller Logs

https://gist.github.com/ventsislav-georgiev/f318f84b6bc6e801d733907087ce287c

Runner Pod Logs

[Irrelevant]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions