From f4368989f21b8ddbc54208b8e85cb622d90fbd75 Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Tue, 4 Nov 2025 14:37:24 +0000 Subject: [PATCH 01/12] KEP-5679: Fallback for HPA on failure to retrieve metrics Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 987 ++++++++++++++++++ .../5679-external-metric-fallback/kep.yaml | 40 + 2 files changed, 1027 insertions(+) create mode 100644 keps/sig-autoscaling/5679-external-metric-fallback/README.md create mode 100644 keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md new file mode 100644 index 00000000000..21202523848 --- /dev/null +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -0,0 +1,987 @@ + +# KEP-5053: Fallback for HPA External Metrics on Retrieval Failure + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +## Summary + +The Horizontal Pod Autoscaler's reliance on external metrics creates a dependency on systems outside the Kubernetes cluster's control. These external systems (cloud provider APIs, third-party monitoring systems, message brokers, etc.) may experience: + +- Network connectivity issues +- Rate limiting +- Service outages +- Authentication/authorization failures +- Degraded performance + +When external metrics become unavailable, the HPA cannot make informed scaling decisions, which can lead to: +- Workloads stuck at insufficient scale during traffic spikes +- Inability to respond to critical business metrics (e.g., queue depth, error rates) +- Over-dependence on external system reliability + +Unlike in-cluster resource metrics (CPU, memory) served by metrics-server, which are part of the cluster's core infrastructure, external metrics are inherently less reliable and outside the cluster operator's direct control. + +## Motivation + +The Horizontal Pod Autoscaler (HPA) is a critical component for scaling Kubernetes +workloads based on resource utilization or custom metrics. However, the current +implementation depends entirely on the availability of the resource metrics API +or custom metrics API to make scaling decisions. If these APIs experience +downtime or degradation, the HPA cannot take any scaling actions, leaving +workloads potentially overprovisioned, underprovisioned, or entirely unmanaged. + +In contrast, other autoscalers like [KEDA](https://keda.sh/) already provide mechanisms to define +fallback strategies in the event of metric retrieval failures. These strategies +mitigate the impact of API unavailability, enabling the autoscaler to maintain +a functional scaling strategy even when metrics are temporarily inaccessible. + +By allowing users to configure fallback behavior in HPA, this proposal aims to +reduce the criticality of the metrics APIs and improve the overall robustness +of the autoscaling system. This change allows users to define safe scaling +actions, both as scaling to a predefined maximum or holding the current scale +(current behavior), ensuring workloads remain operational and better aligned +with user-defined requirements during unexpected disruptions. + +Additionally, the community has also expressed interest in addressing this +limitation in the past. ([#109214](https://github.com/kubernetes/kubernetes/issues/109214)) + +### Goals + +- Allow users to optionally define fallback values for external metrics when retrieval fails +- Provide per-metric failure tracking and fallback behavior +- Maintain the HPA's scaling algorithm and respect min/max replica constraints +- Ensure users can determine which specific metrics are using fallback values + +### Non-Goals + +- Fallback for resource metrics (CPU, memory from metrics-server) - these are in-cluster and should be addressed at the infrastructure level if unavailable +- Fallback for pods/object metrics - these use in-cluster APIs +- Fallback for custom metrics - may be considered in future based on alpha feedback +- Last-known-good metric value caching +- Automatic fallback value calculation +- Changing the HPA scaling algorithm + +## Proposal + +Add optional fallback configuration to the [ExternalMetricSource](https://github.com/kubernetes/kubernetes/blob/48c56e04e0bc2cdc33eb67ee36ca69eba96b5d0b/staging/src/k8s.io/api/autoscaling/v2/types.go#L343) type, allowing users to specify: + +1. A failure threshold (number of consecutive failures before activating fallback) +2. A substitute metric value to use when the threshold is exceeded + +This approach: +- **Maintains the HPA algorithm**: Fallback provides a metric value, not a fixed replica count +- **Is per-metric**: Each external metric can have its own fallback configuration +- **Provides visibility**: Status shows which metrics are in fallback state +- **Is conservative**: Only applies to external metrics, which are inherently out-of-cluster + +### User Stories + +#### Story 1: SaaS Application Scaling on Queue Depth + +I run a SaaS application that scales based on a cloud provider's message queue depth (external metric). Occasionally, the cloud provider's metrics API experiences brief outages (5-10 minutes). During these outages, my HPA cannot scale, and customer requests queue up. + +## Proposal + +Add optional fallback configuration to the `ExternalMetricSource` type, allowing users to specify: + +1. A failure threshold (number of consecutive failures before activating fallback) +2. A substitute metric value to use when the threshold is exceeded + +This approach: +- **Maintains the HPA algorithm**: Fallback provides a metric value, not a fixed replica count +- **Is per-metric**: Each external metric can have its own fallback configuration +- **Provides visibility**: Status shows which metrics are in fallback state +- **Is conservative**: Only applies to external metrics, which are inherently out-of-cluster + +### User Stories + +#### Story 1: SaaS Application Scaling on Queue Depth + +I run a SaaS application that scales based on a cloud provider's message queue depth (external metric). Occasionally, the cloud provider's metrics API experiences brief outages (5-10 minutes). During these outages, my HPA cannot scale, and customer requests queue up. + +With this feature, I can configure: +```yaml +metrics: +- type: External + external: + metric: + name: queue_depth + target: + type: AverageValue + averageValue: "30" + fallback: + failureThreshold: 3 + averageValue: "100" # Assume high queue depth, scale up +``` + +When the external API fails, the HPA treats the queue depth as 100, triggering scale-up to handle the presumed backlog safely. + +#### Story 2: E-commerce Site with Multiple External Metrics + +My e-commerce site scales on both external error rates and external request latency from a third-party monitoring system. I want different fallback strategies: + +```yaml +metrics: +- type: External + external: + metric: + name: error_rate + target: + type: Value + value: "0.01" # 1% error rate + fallback: + failureThreshold: 3 + value: "0.05" # Assume higher errors, scale up +- type: External + external: + metric: + name: p99_latency_ms + target: + type: Value + value: "200" + fallback: + failureThreshold: 3 + value: "500" # Assume high latency, scale up +``` + +If only one metric fails, the HPA continues using the healthy metric while falling back for the failed one. + +### Risks and Mitigations + +- Risk: Users configure inappropriate fallback values + - Mitigation: Documentation with best practices; validation ensures values are positive; HPA min/max constraints still apply +- Risk: Complexity in understanding which metric is in fallback + - Mitigation: Per-metric status clearly shows fallback state and failure count + +## Design Details + +Add a new `ExternalMetricFallback` type and include it in `ExternalMetricSource`: + +```golang +// ExternalMetricFallback defines fallback behavior when an external metric cannot be retrieved +type ExternalMetricFallback struct { + // failureThreshold is the number of consecutive failures retrieving this metric + // before the fallback value is used. Must be greater than 0. + // +optional + // +kubebuilder:default=3 + FailureThreshold *int32 `json:"failureThreshold,omitempty"` + + // value is the fallback metric value to use when the external metric cannot be retrieved. + // Exactly one of value or averageValue must be set, matching the target type. + // +optional + Value *resource.Quantity `json:"value,omitempty"` + + // averageValue is the fallback metric value per pod to use when the external metric cannot be retrieved. + // Exactly one of value or averageValue must be set, matching the target type. + // +optional + AverageValue *resource.Quantity `json:"averageValue,omitempty"` +} + +// ExternalMetricSource indicates how to scale on a metric not associated with +// any Kubernetes object (for example length of queue in cloud +// messaging service, or QPS from loadbalancer running outside of cluster). +type ExternalMetricSource struct { + // metric identifies the target metric by name and selector + Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"` + + // target specifies the target value for the given metric + Target MetricTarget `json:"target" protobuf:"bytes,2,name=target"` + + // fallback defines the behavior when this external metric cannot be retrieved. + // If not set, the HPA will not scale based on this metric when it's unavailable. + // +optional + Fallback *ExternalMetricFallback `json:"fallback,omitempty"` +} +``` + +Update `MetricStatus` to include per-metric fallback information: + +```golang +// ExternalMetricStatus indicates the current value of a global metric not associated +// with any Kubernetes object. +type ExternalMetricStatus struct { + // metric identifies the target metric by name and selector + Metric MetricIdentifier `json:"metric" protobuf:"bytes,1,name=metric"` + + // current contains the current value for the given metric + Current MetricValueStatus `json:"current" protobuf:"bytes,2,name=current"` + + // fallbackActive indicates whether this metric is currently using a fallback value + // due to retrieval failures. + // +optional + FallbackActive bool `json:"fallbackActive,omitempty"` + + // consecutiveFailureCount tracks the number of consecutive failures retrieving this metric. + // Reset to 0 on successful retrieval. + // +optional + ConsecutiveFailureCount int32 `json:"consecutiveFailureCount,omitempty"` +} +``` + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +None required. + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +We will add the following e2e autoscaling tests: + +- External metric failure triggers fallback after threshold is reached +- Success in retrieving external metric resets the failure count +- HPA continues using other healthy metrics while one is in fallback +- Fallback respects HPA min/max replica constraints +- Status correctly reflects which metrics are in fallback state + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind `HPAExternalMetricFallback` feature gate +- Unit and e2e tests passed as designed in [TestPlan](#test-plan). + +#### Beta + +- Unit and e2e tests passed as designed in [TestPlan](#test-plan). +- Gather feedback from developers and surveys +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved + +#### GA + +- No negative feedback. +- All issues and gaps identified as feedback during beta are resolved + +### Upgrade / Downgrade Strategy + +#### Upgrade + +When the feature gate is enabled: +- Existing HPAs continue to work unchanged +- External metrics without `fallback` configuration behave as they do today (no scaling when unavailable) +- Users can add `fallback` configuration to external metrics in their HPAs +- The controller begins tracking per-metric `consecutiveFailureCount` for external metrics with fallback configured, starting from 0 +- The `fallbackActive` and `consecutiveFailureCount` status fields are populated for external metrics + +#### Downgrade + +When the feature gate is disabled: +- The `fallback` field in `ExternalMetricSource` is ignored by the controller +- The `fallbackActive` and `consecutiveFailureCount` status fields are not updated (remain at last values but are not used) +- All external metrics revert to current behavior: HPA cannot scale based on them when they're unavailable +- Any HPAs currently using fallback values will: + - Maintain their current replica count + - Stop using fallback values + - Resume normal metric-based scaling when external metrics become available again +- No disruption to running workloads (pods are not restarted) + +All logic related to fallback evaluation, failure counting, and status updates is gated by the `HPAExternalMetricFallback` feature gate. + + + +### Version Skew Strategy + + + +1. `kube-apiserver`: More recent instances will accept and validate the new `fallback` field in `ExternalMetricSource`, While older instances will ignore it during validation and persist it as part of the HPA object. +2. `kube-controller-manager`: An older version could receive an HPA containing the new `fallback` field from a more recent API server, in which case it would ignore the field (i.e., continue with current behavior where external metrics that fail to retrieve prevent scaling) + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: HPAExternalMetricFallback + - Components depending on the feature gate: `kube-controller-manager` and `kube-apiserver` + +###### Does enabling the feature change any default behavior? + +No. By default, HPAs will continue to behave as they do today. The feature only activates when users explicitly configure the `fallback` field on external metrics in their HPA specifications. +External metrics without fallback configuration will continue to prevent scaling when unavailable, which is the current behavior. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. If the feature gate is disabled: +- All `fallback` configurations in HPA specs are ignored by the controller +- External metrics revert to current behavior: HPA cannot scale based on them when they're unavailable +- The `fallbackActive` and `consecutiveFailureCount` status fields stop being updated +- HPAs maintain their current replica count at the time of rollback +- No pods are restarted or disrupted + +To disable, restart `kube-controller-manager` and `kube-apiserver` with the feature gate set to `false`. + +###### What happens if we reenable the feature if it was previously rolled back? + +When the feature is re-enabled: +- Any HPAs with `fallback` configured on external metrics will resume fallback behavior +- The controller immediately begins tracking `consecutiveFailureCount` for each external metric with fallback, starting from 0 +- If external metrics are failing at re-enablement: + - Failure counts increment on each reconciliation loop + - Once the configured `failureThreshold` is reached, fallback values are used + - The `fallbackActive` status field is set to `true` for affected metrics +- HPAs resume using substitute metric values for scaling decisions when external metrics are unavailable and thresholds are exceeded + +Existing HPAs without `fallback` configuration are not affected by re-enabling the feature and continue with default behavior. + +###### Are there any tests for feature enablement/disablement? + +Yes. Unit tests will verify that HPAs with and without the `fallback` field are properly validated both when the feature gate is enabled or disabled, and that the HPA controller correctly applies fallback behavior based on the feature gate status. + + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + +Rollout failures are unlikely to impact running workloads. If enabled during external metrics failures, HPAs with fallback configured might change scaling decisions after the failureThreshold is reached. This is mitigated by: +- The HPA's min/max replica constraints +- The failure threshold requirement (default: 3) before activation +- Gradual HPA scaling behavior +On rollback, HPAs maintain their current replica count and stop using fallback values. No pods are restarted. + +###### What specific metrics should inform a rollback? + + +- Unexpected scaling events after enabling the feature +- Increased error rate in horizontal_pod_autoscaler_controller_metric_computation_total +- High percentage of HPAs showing fallbackActive: true unexpectedly +- Increased latency in horizontal_pod_autoscaler_controller_reconciliation_duration_seconds + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No. This feature only adds a new optional field to the HPA API and doesn't deprecate or remove any existing functionality. All current HPA behaviors remain unchanged unless users explicitly opt into the fallback mode. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + +The presence of the `fallback` field in `ExternalMetricSource` specifications indicates that the feature is in use. + +###### How can someone using this feature know that it is working for their instance? + + +Users can confirm that the feature is active and functioning by inspecting the status fields exposed by the controller. Specifically: +- Check `.status.currentMetrics[].external.fallbackActive` to verify if fallback is currently active +- Check `.status.currentMetrics[].external.consecutiveFailureCount` to see the current failure count + +Moreover, users can verify the feature is working properly through events on the HPA object: +- When fallback activates: Normal `ExternalMetricFallbackActivated` "Fallback activated for external metric 'queue_depth' after 3 consecutive failures" + + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + +This feature utilizes the existing HPA controller metrics: +- `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` +- `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` +- `horizontal_pod_autoscaler_controller_metric_computation_total` + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + +This feature doesn't fundamentally change how the HPA controller operates; it adds fallback handling when external metrics fail to be retrieved. Therefore, existing metrics for monitoring HPA controller health remain applicable: +- `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` - monitors overall HPA reconciliation performance +- `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` - tracks metric computation time including fallback evaluation +- `horizontal_pod_autoscaler_controller_metric_computation_total` - counts metric computations with error status + +Additionally, the new feature-specific metrics provide health indicators: +- `horizontal_pod_autoscaler_fallback_active` - indicates which external metrics are currently using fallback values +- `horizontal_pod_autoscaler_external_metric_retrieval_failures_total` - monitors external metric retrieval failures + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + +The following new metrics will be added to the kube-controller-manager to improve observability of the fallback feature: +- `horizontal_pod_autoscaler_fallback_active` (Gauge) - Indicates whether a specific external metric is currently using its fallback value. +- `horizontal_pod_autoscaler_external_metric_retrieval_failures_total` (Count) - Counts the number of consecutive failures when retrieving an external metric (resets to 0 on success). + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +No. The feature only adds logic to the existing HPA reconciliation loop. It doesn't introduce new API calls. +The feature tracks failure counts and applies fallback logic in-memory during existing reconciliation cycles. + +###### Will enabling / using this feature result in introducing new API types? + + +No. The feature only adds new fields to existing API types: +- New `ExternalMetricFallback` struct within `ExternalMetricSource` +- New status fields in `ExternalMetricStatus` + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +Yes, HorizontalPodAutoscaler objects will increase in size when fallback is configured: + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +Yes, `HorizontalPodAutoscaler` objects will increase in size when fallback is configured: +- Spec increase: ~150 bytes per external metric with fallback configured: + - failureThreshold: ~30 bytes + - value or averageValue: ~50 bytes (resource.Quantity) +- Status increase: ~80 bytes per external metric: + - fallbackActive: ~30 bytes (boolean field) + - consecutiveFailureCount: ~50 bytes (int32 field) + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +No. Memory increase in kube-controller-manager is ~100 bytes per HPA for failure count tracking. For 1000 HPAs with 2 external metrics each: ~200 KB total, which is negligible. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to: +- Retrieve HPA objects +- Get external metrics (or any metrics) +- Update HPA status (including `fallbackActive` and `consecutiveFailureCount` fields) +- Apply scaling decisions + +Therefore, no autoscaling decisions can be made during this period, regardless of whether fallback is configured. The feature itself doesn't introduce any new failure modes with respect to API server or etcd availability - it's dependent on these components being available just like the rest of the HPA controller's functionality. + +Once API server and etcd access is restored, the HPA controller will resume normal operation. The in-memory failure counts will reset, and if external metrics are still failing, the failure count will increment again until reaching the threshold to reactivate fallback. + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + +- Check if issue affects only HPAs with fallback configured also check `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` and `horizontal_pod_autoscaler_fallback_active_total` metrics +- Review HPA events: kubectl describe hpa +- Verify external metrics provider health +- Temporarily remove fallback configuration from affected HPAs if needed + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml new file mode 100644 index 00000000000..107d62c685b --- /dev/null +++ b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml @@ -0,0 +1,40 @@ +title: Fallback for HPA on failure to retrieve metrics +kep-number: 5679 +authors: + - "@omerap12" + - "@adrianmoisey" +owning-sig: sig-autoscaling +status: provisional +creation-date: 2025-04-11 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: TBD + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.36" + beta: "v1.37" + stable: "v1.38" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: HPAExternalMetricFallback + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - horizontal_pod_autoscaler_fallback_active + - horizontal_pod_autoscaler_external_metric_retrieval_failures_total From e408be134ee0edb4e46ada0c190c9f55866468fb Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Wed, 5 Nov 2025 12:29:37 +0000 Subject: [PATCH 02/12] Add tests section Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 21 +++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md index 21202523848..ee278243ccb 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/README.md +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -384,7 +384,20 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +- Tests for Fallback Configuration: + - Verify failureThreshold validation (must be > 0) + - Verify validation is skipped when feature gate is disabled +- Tests for Failure Tracking and Activation: + - Verify `consecutiveFailureCount` increments on failure and resets on success + - Verify fallback activates when threshold is reached + - Verify fallbackActive status field updates correctly +- Tests for Replica Calculation: + - Verify `GetExternalMetricReplicas` and `GetExternalPerPodMetricReplicas` functions use fallback values when conditions are met + - Verify replica calculations respect min/max constraints with fallback values + - Verify correct behavior with multiple external metrics (independent failure tracking) + +- `/pkg/controller/podautoscaler`: 05 Nov 2025 - 89.1% +- `/pkg/controller/podautoscaler/metrics`: 05 Nov 2025 - 89.9% ##### Integration tests @@ -403,7 +416,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for https://storage.googleapis.com/k8s-triage/index.html --> -- : +N/A, the feature is tested using unit tests and e2e tests. ##### e2e tests @@ -959,10 +972,6 @@ Major milestones might include: - the version of Kubernetes where the KEP graduated to general availability - when the KEP was retired or superseded --> -- Check if issue affects only HPAs with fallback configured also check `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` and `horizontal_pod_autoscaler_fallback_active_total` metrics -- Review HPA events: kubectl describe hpa -- Verify external metrics provider health -- Temporarily remove fallback configuration from affected HPAs if needed ## Drawbacks From f439cde531e37141c4b04090fbf929435723cebd Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Wed, 5 Nov 2025 12:31:46 +0000 Subject: [PATCH 03/12] Add milestone Signed-off-by: Omer Aplatony --- keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml index 107d62c685b..38ef0874057 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml +++ b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml @@ -17,7 +17,7 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: TBD +latest-milestone: "v1.36" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From 1fb40d67f24c92c74913834c269f9c57bc7774b2 Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Wed, 5 Nov 2025 13:21:06 +0000 Subject: [PATCH 04/12] Add SLO section Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md index ee278243ccb..8695e634ab1 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/README.md +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -104,9 +104,9 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented +- [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) @@ -960,6 +960,19 @@ For each of them, fill in the following information by copying the below templat ###### What steps should be taken if SLOs are not being met to determine the problem? +Check `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` and `horizontal_pod_autoscaler_fallback_active` to identify if issues correlate with HPAs using fallback. If problems are observed: + +- Check if the issue only affects HPAs with fallback configured +- Verify if multiple external metrics are failing simultaneously (check `horizontal_pod_autoscaler_external_metric_retrieval_failures_total`) +- Review HPA events: kubectl describe hpa to see fallback activation events +- Check external metrics provider health and connectivity + +For problematic HPAs, you can: + +- Temporarily remove the fallback field to revert to default behavior (HPA holds current scale on metric failure) +- Adjust `failureThreshold` to prevent premature fallback activation +- Review and adjust fallback values if scaling behavior is inappropriate + ## Implementation History Users can confirm that the feature is active and functioning by inspecting the status fields exposed by the controller. Specifically: +- Check the HPA condition to verify if `ExternalMetricFallbackActive` is currently active - Check `.status.currentMetrics[].external.fallbackActive` to verify if fallback is currently active - Check `.status.currentMetrics[].external.consecutiveFailureCount` to see the current failure count From 2b53881ab1ccd79813e9d0bef57bcec515e21e5e Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Sat, 8 Nov 2025 19:41:05 +0000 Subject: [PATCH 09/12] Moved to duraion based configuration Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 142 ++++++++++-------- 1 file changed, 76 insertions(+), 66 deletions(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md index 919242ce253..da7c85e5ee6 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/README.md +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -170,6 +170,20 @@ Other autoscalers in the ecosystem, such as [KEDA](https://keda.sh/), already pr - Enable users to define safe, conservative scaling actions when external metrics are temporarily unavailable - Maintain workload availability and performance during external metrics provider disruptions +**Why Duration-Based Instead of Count-Based:** + +Different Kubernetes providers and configurations may poll external metrics at different frequencies. The HPA reconciliation loop typically runs every 15 seconds by default (configurable via `--horizontal-pod-autoscaler-sync-period`), but this can vary between clusters. A count-based threshold (e.g., "3 failures") would result in inconsistent behavior: +- In a cluster polling every 15s: 3 failures = 45 seconds +- In a cluster polling every 30s: 3 failures = 90 seconds +- If polling frequency changes, behavior changes unexpectedly + +A duration-based threshold provides consistent, predictable behavior regardless of: +- HPA controller reconciliation frequency +- Kubernetes provider configurations +- Cluster-specific settings + +The duration is measured from the first consecutive failure, ensuring consistent and understandable semantics: "activate fallback if the metric has been failing for at least X minutes." + This enhancement allows users to specify a desired replica count that the HPA should use after a configurable number of consecutive failures to retrieve an external metric. The fallback replica count is treated as the desired replica count from that metric and combined with other metrics using the HPA's standard multi-metric algorithm (taking the maximum), respecting all configured constraints (min/max replicas, behavior policies, etc.), ensuring predictable and safe scaling decisions even when external metrics are unavailable. The community has previously expressed interest in addressing this limitation [#109214](https://github.com/kubernetes/kubernetes/issues/109214). @@ -194,33 +208,15 @@ The community has previously expressed interest in addressing this limitation [# Add optional fallback configuration to the [ExternalMetricSource](https://github.com/kubernetes/kubernetes/blob/48c56e04e0bc2cdc33eb67ee36ca69eba96b5d0b/staging/src/k8s.io/api/autoscaling/v2/types.go#L343) type, allowing users to specify: -1. A failure threshold (number of consecutive failures before activating fallback) +1. A failure duration (how long the metric must be continuously failing before activating fallback) 2. A substitute metric value to use when the threshold is exceeded -This approach: -- **Maintains the HPA algorithm**: Fallback provides a metric value, not a fixed replica count -- **Is per-metric**: Each external metric can have its own fallback configuration -- **Provides visibility**: Status shows which metrics are in fallback state -- **Is conservative**: Only applies to external metrics, which are inherently out-of-cluster - -### User Stories - -#### Story 1: SaaS Application Scaling on Queue Depth - -I run a SaaS application that scales based on a cloud provider's message queue depth (external metric). Occasionally, the cloud provider's metrics API experiences brief outages (5-10 minutes). During these outages, my HPA cannot scale, and customer requests queue up. - -## Proposal - -Add optional fallback configuration to the `ExternalMetricSource` type, allowing users to specify: - -1. A failure threshold (number of consecutive failures before activating fallback) -2. A desired replica count to use when the threshold is exceeded - This approach: - **Works with the HPA algorithm**: Fallback provides a desired replica count for that metric, which is combined with other metrics using the standard HPA multi-metric approach (taking the maximum) - **Is per-metric**: Each external metric can have its own fallback configuration - **Provides visibility**: Status shows which metrics are in fallback state - **Is conservative**: Only applies to external metrics, which are inherently out-of-cluster +- **Is consistent**: Duration-based thresholds behave the same across different Kubernetes configurations and reconciliation frequencies ### User Stories @@ -239,7 +235,7 @@ metrics: type: AverageValue averageValue: "30" fallback: - failureThreshold: 3 + failureDuration: 3m # Activate fallback after 3 minutes of consecutive failures replicas: 10 # Scale to 10 replicas to handle presumed backlog ``` @@ -259,7 +255,7 @@ metrics: type: Value value: "0.01" # 1% error rate fallback: - failureThreshold: 3 + failureDuration: 5m # Activate after 5 minutes replicas: 15 # Scale to 15 replicas assuming higher load - type: External external: @@ -269,7 +265,7 @@ metrics: type: Value value: "200" fallback: - failureThreshold: 3 + failureDuration: 3m # Activate after 3 minutes replicas: 12 # Scale to 12 replicas assuming higher load ``` @@ -278,9 +274,16 @@ If only one metric fails, the HPA continues using the healthy metric while treat ### Risks and Mitigations - Risk: Users configure inappropriate fallback replica counts - - Mitigation: Documentation with best practices; validation ensures replicas > 0; HPA min/max constraints still apply -- Risk: Complexity in understanding which metric is in fallback - - Mitigation: Per-metric status clearly shows fallback state and failure count + - Mitigation: Documentation with best practices; validation ensures replicas > 0; HPA min/max constraints still apply; users should consider peak load scenarios when setting fallback values + +- Risk: Users configure failureDuration too short, causing premature fallback activation + - Mitigation: Default value of 3 minutes provides reasonable buffer; validation enforces minimum values; documentation recommends considering normal metric provider latency and transient failures + +- Risk: Users configure failureDuration too long, delaying necessary scaling during outages + - Mitigation: Documentation provides guidance on balancing between avoiding false positives and responding quickly to genuine outages; recommend 3-5 minutes for most use cases + +- Risk: Complexity in understanding which metric is in fallback and why + - Mitigation: Per-metric status clearly shows fallback state, `firstFailureTime` timestamp, and current `fallbackReplicas` value; events are generated when fallback activates with clear messaging including duration and timestamp ## Design Details @@ -289,11 +292,12 @@ Add a new `ExternalMetricFallback` type and include it in `ExternalMetricSource` ```golang // ExternalMetricFallback defines fallback behavior when an external metric cannot be retrieved type ExternalMetricFallback struct { - // failureThreshold is the number of consecutive failures retrieving this metric - // before the fallback value is used. Must be greater than 0. + // failureDuration is the duration for which the external metric must be continuously + // failing before the fallback value is used. The duration is measured from the first + // consecutive failure. Must be greater than 0. // +optional - // +kubebuilder:default=3 - FailureThreshold *int32 `json:"failureThreshold,omitempty"` + // +kubebuilder:default="3m" + FailureDuration *metav1.Duration `json:"failureDuration,omitempty"` // replicas is the desired replica count to use when the external metric cannot be retrieved. // This value is treated as the desired replica count from this metric. @@ -338,10 +342,10 @@ type ExternalMetricStatus struct { // +optional FallbackActive bool `json:"fallbackActive,omitempty"` - // consecutiveFailureCount tracks the number of consecutive failures retrieving this metric. - // Reset to 0 on successful retrieval. + // firstFailureTime is the timestamp of the first consecutive failure retrieving this metric. + // Reset to nil on successful retrieval. Used to calculate if failureDuration has been exceeded. // +optional - ConsecutiveFailureCount int32 `json:"consecutiveFailureCount,omitempty"` + FirstFailureTime *metav1.Time `json:"firstFailureTime,omitempty"` // fallbackReplicas is the replica count being used while fallback is active. // Only populated when fallbackActive is true. @@ -415,8 +419,9 @@ extending the production code to implement this enhancement. - Verify failureThreshold validation (must be > 0) - Verify replicas validation (must be > 0) - Tests for Failure Tracking and Activation: - - Verify `consecutiveFailureCount` increments on failure and resets on success - - Verify fallback activates when threshold is reached + - Verify `firstFailureTime` is set on first failure and persists through consecutive failures + - Verify `firstFailureTime` is cleared on successful metric retrieval + - Verify fallback activates when current time exceeds `firstFailureTime` + `failureDuration` - Verify fallbackActive status field updates correctly - Tests for Replica Calculation: - Verify fallback returns the configured replica count when threshold is exceeded @@ -465,7 +470,7 @@ We will add the following e2e autoscaling tests: - Success in retrieving external metric resets the failure count and resumes normal scaling - HPA uses max() of healthy metric calculations and fallback replica counts - Fallback respects HPA min/max replica constraints -- Status correctly reflects which metrics are in fallback state +- Status correctly reflects which metrics are in fallback state and shows `firstFailureTime` - With multiple external metrics in fallback, HPA uses the maximum fallback replica count ### Graduation Criteria @@ -560,20 +565,25 @@ When the feature gate is enabled: - Existing HPAs continue to work unchanged - External metrics without `fallback` configuration behave as they do today (no scaling when unavailable) - Users can add `fallback` configuration to external metrics in their HPAs -- The controller begins tracking per-metric `consecutiveFailureCount` for external metrics with fallback configured, starting from 0 -- The `fallbackActive` and `consecutiveFailureCount` status fields are populated for external metrics +- The controller begins tracking per-metric `firstFailureTime` for external metrics with fallback configured + - On the first failure, `firstFailureTime` is set to the current timestamp + - On subsequent failures, the timestamp is preserved to track failure duration + - On success, `firstFailureTime` is cleared (set to nil) +- The `fallbackActive`, `firstFailureTime`, and `fallbackReplicas` status fields are populated for external metrics with fallback configured +- Fallback activates when `(current time - firstFailureTime) >= failureDuration` #### Downgrade When the feature gate is disabled: - The `fallback` field in `ExternalMetricSource` is ignored by the controller -- The `fallbackActive` and `consecutiveFailureCount` status fields are not updated (remain at last values but are not used) +- The `fallbackActive`, `firstFailureTime`, and `fallbackReplicas` status fields are not updated (remain at last values but are not used) - All external metrics revert to current behavior: HPA cannot scale based on them when they're unavailable - Any HPAs currently using fallback values will: - Maintain their current replica count - Stop using fallback values - Resume normal metric-based scaling when external metrics become available again - No disruption to running workloads (pods are not restarted) +- The `firstFailureTime` timestamp remains in the status but is not evaluated or updated All logic related to fallback evaluation, failure counting, and status updates is gated by the `HPAExternalMetricFallback` feature gate. @@ -655,7 +665,8 @@ When fallback is configured and activated, the failing metric contributes its co Yes. If the feature gate is disabled: - All `fallback` configurations in HPA specs are ignored by the controller - External metrics revert to current behavior: HPA cannot scale based on them when they're unavailable -- The `fallbackActive` and `consecutiveFailureCount` status fields stop being updated +- The `fallbackActive`, `firstFailureTime`, and `fallbackReplicas` status fields stop being updated + - These fields remain in the HPA status at their last values but are not evaluated or modified - HPAs maintain their current replica count at the time of rollback - No pods are restarted or disrupted @@ -665,12 +676,13 @@ To disable, restart `kube-controller-manager` and `kube-apiserver` with the feat When the feature is re-enabled: - Any HPAs with `fallback` configured on external metrics will resume fallback behavior -- The controller immediately begins tracking `consecutiveFailureCount` for each external metric with fallback, starting from 0 +- The controller clears any stale `firstFailureTime` timestamps and starts fresh - If external metrics are failing at re-enablement: - - Failure counts increment on each reconciliation loop - - Once the configured `failureThreshold` is reached, fallback values are used + - On the first failure, `firstFailureTime` is set to the current timestamp + - The failure duration is calculated as `(current time - firstFailureTime)` + - Once the configured `failureDuration` has elapsed, fallback values are used - The `fallbackActive` status field is set to `true` for affected metrics -- HPAs resume using substitute metric values for scaling decisions when external metrics are unavailable and thresholds are exceeded +- HPAs resume using the static replicas stanza for scaling decisions when external metrics are unavailable and thresholds are exceeded Existing HPAs without `fallback` configuration are not affected by re-enabling the feature and continue with default behavior. @@ -772,11 +784,10 @@ Recall that end users cannot usually observe component logs or access metrics. Users can confirm that the feature is active and functioning by inspecting the status fields exposed by the controller. Specifically: - Check the HPA condition to verify if `ExternalMetricFallbackActive` is currently active - Check `.status.currentMetrics[].external.fallbackActive` to verify if fallback is currently active -- Check `.status.currentMetrics[].external.consecutiveFailureCount` to see the current failure count +- Check `.status.currentMetrics[].external.firstFailureTime` to see when failures started Moreover, users can verify the feature is working properly through events on the HPA object: -- When fallback activates: Normal `ExternalMetricFallbackActivated` "Fallback activated for external metric 'queue_depth' after 3 consecutive failures, using fallback replica count: 10" - +- When fallback activates: Normal `ExternalMetricFallbackActivated` "Fallback activated for external metric 'queue_depth' after 3m0s of consecutive failures, using fallback replica count: 10" ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -809,19 +820,13 @@ This feature doesn't fundamentally change how the HPA controller operates; it ad - `horizontal_pod_autoscaler_controller_metric_computation_duration_seconds` - tracks metric computation time including fallback evaluation - `horizontal_pod_autoscaler_controller_metric_computation_total` - counts metric computations with error status -Additionally, the new feature-specific metrics provide health indicators: -- `horizontal_pod_autoscaler_fallback_active` - indicates which external metrics are currently using fallback values -- `horizontal_pod_autoscaler_external_metric_retrieval_failures_total` - monitors external metric retrieval failures - ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -The following new metrics will be added to the kube-controller-manager to improve observability of the fallback feature: -- `horizontal_pod_autoscaler_fallback_active` (Gauge) - Indicates whether a specific external metric is currently using its fallback value. -- `horizontal_pod_autoscaler_external_metric_retrieval_failures_total` (Count) - Counts the number of consecutive failures when retrieving an external metric (resets to 0 on success). +No. ### Dependencies @@ -904,7 +909,14 @@ Describe them, providing: - Estimated increase in size: (e.g., new annotation of size 32B) - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> -Yes, HorizontalPodAutoscaler objects will increase in size when fallback is configured: +Yes, `HorizontalPodAutoscaler` objects will increase in size when fallback is configured: +- Spec increase: ~150 bytes per external metric with fallback configured: + - `failureDuration`: ~40 bytes (field name + duration string like "3m") + - `replicas`: ~20 bytes (int32) +- Status increase: ~80 bytes per external metric: + - `fallbackActive`: ~30 bytes (boolean field) + - `firstFailureTime`: ~70 bytes (timestamp field + RFC3339 string like "2024-01-15T10:23:45Z") + - `fallbackReplicas`: ~30 bytes (optional int32 pointer) ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -916,13 +928,12 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> -Yes, `HorizontalPodAutoscaler` objects will increase in size when fallback is configured: -- Spec increase: ~150 bytes per external metric with fallback configured: - - failureThreshold: ~30 bytes - - value or averageValue: ~50 bytes (resource.Quantity) -- Status increase: ~80 bytes per external metric: - - fallbackActive: ~30 bytes (boolean field) - - consecutiveFailureCount: ~50 bytes (int32 field) + +No. The feature adds minimal computational overhead to the existing HPA reconciliation loop. The fallback logic is integrated into the existing metric retrieval and evaluation process: +1. Attempt to retrieve external metric (already happens) +2. On failure: check/update `firstFailureTime` (new, minimal overhead) +3. Evaluate if fallback should activate (new, simple comparison) +4. Return either real metric or fallback replica count (already happens for other metric types) ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? @@ -968,12 +979,12 @@ details). For now, we leave it here. If the API server and/or etcd becomes unavailable, the entire HPA controller functionality will be impacted, not just this feature. The HPA controller will not be able to: - Retrieve HPA objects - Get external metrics (or any metrics) -- Update HPA status (including `fallbackActive` and `consecutiveFailureCount` fields) +- Update HPA status (including `fallbackActive`, `firstFailureTime`, and `fallbackReplicas` fields) - Apply scaling decisions Therefore, no autoscaling decisions can be made during this period, regardless of whether fallback is configured. The feature itself doesn't introduce any new failure modes with respect to API server or etcd availability - it's dependent on these components being available just like the rest of the HPA controller's functionality. -Once API server and etcd access is restored, the HPA controller will resume normal operation. The in-memory failure counts will reset, and if external metrics are still failing, the failure count will increment again until reaching the threshold to reactivate fallback. +Once API server and etcd access is restored, the HPA controller will resume normal operation. The in-memory failure counts will reset, if external metrics are still failing and `firstFailureTime` is perseved the controller will use that timestamp to calculate ###### What are other known failure modes? @@ -992,10 +1003,9 @@ For each of them, fill in the following information by copying the below templat ###### What steps should be taken if SLOs are not being met to determine the problem? -Check `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` and `horizontal_pod_autoscaler_fallback_active` to identify if issues correlate with HPAs using fallback. If problems are observed: +Check `horizontal_pod_autoscaler_controller_reconciliation_duration_seconds` to identify if issues correlate with HPAs using fallback. If problems are observed: - Check if the issue only affects HPAs with fallback configured -- Verify if multiple external metrics are failing simultaneously (check `horizontal_pod_autoscaler_external_metric_retrieval_failures_total`) - Review HPA events: kubectl describe hpa to see fallback activation events - Check external metrics provider health and connectivity From fe3c01f3452a1129e106acfc5b50ca9cf200379d Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Sat, 8 Nov 2025 19:41:59 +0000 Subject: [PATCH 10/12] TBD metrics Signed-off-by: Omer Aplatony --- keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml index a213cb09ab8..170efc34a97 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml +++ b/keps/sig-autoscaling/5679-external-metric-fallback/kep.yaml @@ -36,5 +36,4 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - - horizontal_pod_autoscaler_fallback_active - - horizontal_pod_autoscaler_external_metric_retrieval_failures_total + - TBD From b0e8f3e653664ffb72fb975f5985b06facb9775e Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Sun, 9 Nov 2025 16:20:39 +0000 Subject: [PATCH 11/12] update toc Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md index da7c85e5ee6..8748f7af5a2 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/README.md +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -62,6 +62,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1: SaaS Application Scaling on Queue Depth](#story-1-saas-application-scaling-on-queue-depth) + - [Story 2: E-commerce Site with Multiple External Metrics](#story-2-e-commerce-site-with-multiple-external-metrics) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [Test Plan](#test-plan) @@ -71,7 +74,11 @@ tags, and then generate with `hack/update-toc.sh`. - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Upgrade](#upgrade) + - [Downgrade](#downgrade) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) From f39d6e748895931bd531e7eca0f54057cfd5e01d Mon Sep 17 00:00:00 2001 From: Omer Aplatony Date: Thu, 13 Nov 2025 10:25:03 +0000 Subject: [PATCH 12/12] Moved to duraion based configuration Signed-off-by: Omer Aplatony --- .../5679-external-metric-fallback/README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/keps/sig-autoscaling/5679-external-metric-fallback/README.md b/keps/sig-autoscaling/5679-external-metric-fallback/README.md index 8748f7af5a2..b0efbc4cbf9 100644 --- a/keps/sig-autoscaling/5679-external-metric-fallback/README.md +++ b/keps/sig-autoscaling/5679-external-metric-fallback/README.md @@ -423,7 +423,7 @@ extending the production code to implement this enhancement. --> - Tests for Fallback Configuration: - - Verify failureThreshold validation (must be > 0) + - Verify failureDuration validation (must be > 0) - Verify replicas validation (must be > 0) - Tests for Failure Tracking and Activation: - Verify `firstFailureTime` is set on first failure and persists through consecutive failures @@ -728,10 +728,12 @@ feature flags will be enabled on some API servers and not others during the rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> -Rollout failures are unlikely to impact running workloads. If enabled during external metrics failures, HPAs with fallback configured might change scaling decisions after the failureThreshold is reached. This is mitigated by: +Rollout failures are unlikely to impact running workloads. If enabled during external metrics failures, HPAs with fallback configured might change scaling decisions after `failureDuration` (default: 3m) has elapsed. This is mitigated by: - The HPA's min/max replica constraints -- The failure threshold requirement (default: 3) before activation +- The `failureDuration` buffer before activation - Gradual HPA scaling behavior +- Scale-up/scale-down stabilization windows + On rollback, HPAs maintain their current replica count and stop using fallback values. No pods are restarted. ###### What specific metrics should inform a rollback?