[WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics #5680

omerap12 · 2025-11-04T14:39:34Z

One-line PR description: Fallback for HPA on failure to retrieve metrics

Issue link: HPA External Metrics Fallback on Retrieval Failure #5679

Other comments:

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot · 2025-11-04T14:39:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-11-04T14:39:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign jackfrancis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-autoscaling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

jm-franc · 2025-11-06T14:48:11Z

keps/sig-autoscaling/5679-external-metric-fallback/README.md

+  // consecutiveFailureCount tracks the number of consecutive failures retrieving this metric.
+  // Reset to 0 on successful retrieval.
+  // +optional
+  ConsecutiveFailureCount int32 `json:"consecutiveFailureCount,omitempty"`


Changing how frequently external metrics are retrieved, and variations between Kubernetes providers, will lead to different behavior.

Would it make sense to use a FailureDuration instead?

I like that idea, duration-based threshold would be more consistent.
We should just track the timestamp of the first failure and activate fallback once the duration threshold is exceeded.
@adrianmoisey , thoughts?

Yeah, duration makes sense. I think we just need to make the description clear that the duration is since first of the consecutive failures

jm-franc · 2025-11-06T14:48:14Z

keps/sig-autoscaling/5679-external-metric-fallback/README.md

+      averageValue: "30"
+    fallback:
+      failureThreshold: 3
+      averageValue: "100"  # Assume high queue depth, scale up


If users specify a fallback.averageValue different from target.averageValue, they'll never stop scaling up (or down).

I suspect that's not what you meant to write here: you probably want to specify the default external metric value, not the average value, and this is in accordance with l.301 below that specifies a value (not averageValue) field.

I think it might actually be the opposite (please correct me if I’m wrong ).
If users set a fallback.averageValue that’s different from target.averageValue, the pods will stop scaling after the fallback kicks in:
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go#L405

But with Value, they’ll never stop scaling - since the calculation multiplies by the current number of ready pods (which is always > 1):
https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/replica_calculator.go#L294

So yeah, we probably need to handle these two cases a bit differently. One possible solution, similar to what KEDA does - is to fall back to a static number of pods. This would address both cases.

Agreed, I think we're saying the same thing, just I'm not as clear as you are :) I suspect you want the fallback to specify 'the external metric value to use when there's a problem retrieving it' (i.e. the value of the usage variable here).

You could specify a fallback static number of pods, but then if 2 failing metrics both specify a fallback, which static number do you use? At first glance it doesn't look as elegant, but perhaps there's a reasonable solution?

We can follow the existing HPA logic by taking the maximum of the two values.
The issue with changing the usage calculation is that it’s multiplied by the current number of ready pods, which could lead to unbounded scaling. Using a fixed number of pods seems like the correct approach to me, but I’m open to better suggestions.

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

KEP-5679: Fallback for HPA on failure to retrieve metrics

f436898

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Nov 4, 2025

k8s-ci-robot requested review from gjtempleton and towca November 4, 2025 14:39

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Nov 4, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 4, 2025

omerap12 mentioned this pull request Nov 4, 2025

HPA External Metrics Fallback on Retrieval Failure #5679

Open

4 tasks

omerap12 changed the title ~~KEP-5679: Fallback for HPA on failure to retrieve metrics~~ [WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics Nov 5, 2025

omerap12 added 3 commits November 5, 2025 12:29

Add tests section

e408be1

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Add milestone

f439cde

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Add SLO section

1fb40d6

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

omerap12 mentioned this pull request Nov 5, 2025

Drop v2beta1 and v2beta2 HPA API versions kubernetes/kubernetes#135141

Open

omerap12 added 2 commits November 6, 2025 06:36

Add jackfrancis as reviewer+approver

30822fd

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Fixed motivation

93cf6b9

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

jm-franc reviewed Nov 6, 2025

View reviewed changes

omerap12 added 4 commits November 8, 2025 10:03

user static replicas

166402c

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Add condition

421147a

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Moved to duraion based configuration

2b53881

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

TBD metrics

fe3c01f

Signed-off-by: Omer Aplatony <omerap12@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics #5680

[WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics #5680

omerap12 commented Nov 4, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Nov 4, 2025

Uh oh!

k8s-ci-robot commented Nov 4, 2025

Uh oh!

jm-franc Nov 6, 2025

Uh oh!

omerap12 Nov 6, 2025

Uh oh!

adrianmoisey Nov 8, 2025

Uh oh!

jm-franc Nov 6, 2025

Uh oh!

omerap12 Nov 6, 2025

Uh oh!

jm-franc Nov 6, 2025

Uh oh!

omerap12 Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics #5680

Are you sure you want to change the base?

[WIP] KEP-5679: Fallback for HPA on failure to retrieve metrics #5680

Conversation

omerap12 commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Nov 4, 2025

Uh oh!

k8s-ci-robot commented Nov 4, 2025

Uh oh!

jm-franc Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

adrianmoisey Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

jm-franc Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

jm-franc Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

omerap12 commented Nov 4, 2025 •

edited

Loading