Skip to content

Conversation

@omerap12
Copy link
Member

@omerap12 omerap12 commented Nov 4, 2025

  • One-line PR description: Fallback for HPA on failure to retrieve metrics
  • Other comments:

Signed-off-by: Omer Aplatony <omerap12@gmail.com>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Nov 4, 2025
@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Nov 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign jackfrancis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 4, 2025
@omerap12 omerap12 changed the title KEP-5679: Fallback for HPA on failure to retrieve metrics [WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics Nov 5, 2025
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
// consecutiveFailureCount tracks the number of consecutive failures retrieving this metric.
// Reset to 0 on successful retrieval.
// +optional
ConsecutiveFailureCount int32 `json:"consecutiveFailureCount,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing how frequently external metrics are retrieved, and variations between Kubernetes providers, will lead to different behavior.

Would it make sense to use a FailureDuration instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea, duration-based threshold would be more consistent.
We should just track the timestamp of the first failure and activate fallback once the duration threshold is exceeded.
@adrianmoisey , thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, duration makes sense. I think we just need to make the description clear that the duration is since first of the consecutive failures

averageValue: "30"
fallback:
failureThreshold: 3
averageValue: "100" # Assume high queue depth, scale up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users specify a fallback.averageValue different from target.averageValue, they'll never stop scaling up (or down).

I suspect that's not what you meant to write here: you probably want to specify the default external metric value, not the average value, and this is in accordance with l.301 below that specifies a value (not averageValue) field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might actually be the opposite (please correct me if I’m wrong ).
If users set a fallback.averageValue that’s different from target.averageValue, the pods will stop scaling after the fallback kicks in:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go#L405

But with Value, they’ll never stop scaling - since the calculation multiplies by the current number of ready pods (which is always > 1):
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go#L294

So yeah, we probably need to handle these two cases a bit differently. One possible solution, similar to what KEDA does - is to fall back to a static number of pods. This would address both cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think we're saying the same thing, just I'm not as clear as you are :) I suspect you want the fallback to specify 'the external metric value to use when there's a problem retrieving it' (i.e. the value of the usage variable here).

You could specify a fallback static number of pods, but then if 2 failing metrics both specify a fallback, which static number do you use? At first glance it doesn't look as elegant, but perhaps there's a reasonable solution?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can follow the existing HPA logic by taking the maximum of the two values.
The issue with changing the usage calculation is that it’s multiplied by the current number of ready pods, which could lead to unbounded scaling. Using a fixed number of pods seems like the correct approach to me, but I’m open to better suggestions.

Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants