diff --git a/keps/prod-readiness/sig-scheduling/3990.yaml b/keps/prod-readiness/sig-scheduling/3990.yaml new file mode 100644 index 00000000000..6dca4302425 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/3990.yaml @@ -0,0 +1,3 @@ +kep-number: 3990 +beta: + approver: "@wojtek-t" diff --git a/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md b/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md new file mode 100644 index 00000000000..c4112199a26 --- /dev/null +++ b/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md @@ -0,0 +1,1121 @@ + +# KEP-3990: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [the fallback could be done when it's actually not needed.](#the-fallback-could-be-done-when-its-actually-not-needed) +- [Design Details](#design-details) + - [new API changes](#new-api-changes) + - [NodeProvisioningFailed](#nodeprovisioningfailed) + - [[Beta] nodeProvisioningTimeout in the scheduler configuration](#beta-nodeprovisioningtimeout-in-the-scheduler-configuration) + - [How we implement NodeProvisioningInProgress in the cluster autoscaler](#how-we-implement-nodeprovisioninginprogress-in-the-cluster-autoscaler) + - [PreemptionFalied](#preemptionfalied) + - [What if are both specified in FallbackCriterion?](#what-if-are-both-specified-in-fallbackcriterion) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [introduce DoNotScheduleUntilNodeProvisioningFailed and DoNotScheduleUntilPreemptionFailed](#introduce-donotscheduleuntilnodeprovisioningfailed-and-donotscheduleuntilpreemptionfailed) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` +to represent when to fallback from DoNotSchedule to ScheduleAnyway. +It can contain two values: `NodeProvisioningFailed` to fall back when the cluster autoscaler fails to create new Node for Pods, +and `PreemptionFailed` to fall back when the preemption doesn't help to make Pods schedulable. + +## Motivation + + + +Pod Topology Spread is designed to enhance high availability by distributing Pods across numerous failure domains. +However, ironically, it can badly affect the availability of Pods +if utilized with `WhenUnsatisfiable: DoNotSchedule`. +Particularly amiss are situations where preemption cannot make a Pod schedulable or the cluster autoscaler is unable to create new Node. +Notably, under these circumstances, the intended Pod Topology Spread can negatively impact Pod availability. + +### Goals + + + +- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]` + - `NodeProvisioningFailed` to fallback when the cluster autoscaler fails to create new Node for Pod. + - `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable. +- A new config field `nodeProvisioningTimeout` is introduced to the PodTopologySpread plugin's configuration. +- introduce `NodeProvisioningInProgress` in Pod condition + - change the cluster autoscaler to set it `false` when it cannot create new Node for the Pod, `true` when success. + +### Non-Goals + + + +- reschedule Pods, which are scheduled by the fallback mode, for a better distribution after some time. + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +Your cluster has the cluster autoscaler +and you widely use Pod Topology Spread with `WhenUnsatisfiable: DoNotSchedule` for zone to strongthen workloads against the zone failure. +And if the cluster autoscaler fails to create new Node for Pods due to the instance stockout, +you want to fallback from DoNotSchedule to ScheduleAnyway +because otherwise you'd hurt the availability of workload to achieve a better availability via Pod Topology Spread. +That's putting the cart before the horse. + +In this case, you can use `NodeProvisioningFailed` in `fallbackCriteria`, +to fallback from DoNotSchedule to ScheduleAnyway + +```yaml +topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: DoNotSchedule + fallbackCriteria: + - NodeProvisioningFailed + labelSelector: + matchLabels: + foo: bar +``` + +#### Story 2 + +Similar to Story 1, but additionally you want to fallback when the cluster autoscaler doesn't react within the certain time. +In that case, maybe the cluster autoscaler is down, or it takes too long time to handle pods. + +In this case, in addition to use `NodeProvisioningFailed` in `fallbackCriteria` like Story 1, +the cluster admin can use `nodeProvisioningTimeout` in the scheduler configuration. + +```yaml +apiVersion: kubescheduler.config.k8s.io/v1 +kind: KubeSchedulerConfiguration +profiles: + - schedulerName: default-scheduler + pluginConfig: + - name: PodTopologySpread + args: + # trigger the fallback if a pending pod has been unschedulable for 5 min, but the cluster autoscaler hasn't yet react + nodeProvisioningTimeout: 5m +``` + +#### Story 3 + +Your cluster doesn't have the cluster autoscaler +and has some low-priority Pods to make space (often called overprovisional Pods, balloon Pods, etc.). +Basically, you want to leverage preemption to achieve the best distribution as much as possible, +so you have to schedule Pods with `WhenUnsatisfiable: DoNotSchedule`. +But, you don't want to make Pods unschedulable by Pod Topology Spread if the preemption won't make Pods schedulable. + +```yaml +topologySpreadConstraints: + - maxSkew: 1 + topologyKey: topology.kubernetes.io/zone + whenUnsatisfiable: DoNotSchedule + fallbackCriteria: + - PreemptionFailed + labelSelector: + matchLabels: + foo: bar +``` + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +#### the fallback could be done when it's actually not needed. + +Even if the Pod is rejected by plugins other than Pod Topology Spread, +when one of specified criteria is satisfied, the scheduler fallbacks from DoNotSchedule to ScheduleAnyway. + +One possible mitigation is to add `UnschedulablePlugins`, which equals to [QueuedPodInfo.UnschedulablePlugins](https://github.com/kubernetes/kubernetes/blob/8a7df727820bafed8cef27e094a0212d758fcd40/pkg/scheduler/framework/types.go#L180), to somewhere in Pod status +so that Pod Topology Spread can decide to fall back only when the Pod was rejected by Pod Topology Spread. + +## Design Details + + + +### new API changes + +```go +// FallbackCriterion represents when the scheduler falls back from the required scheduling constraint to the preferred one. +type FallbackCriterion string + +const ( + // NodeProvisioningFailed represents when the Pod has `NodeProvisioningInProgress: false` in its condition. + NodeProvisioningFailed FallbackCriterion = "NodeProvisioningFailed" + // PreemptionFailed represents when the scheduler tried to make space for the Pod by the preemption, but failed. + // Specifically, when the Pod doesn't have `NominatedNodeName` while having `PodScheduled: false`. + PreemptionFailed FallbackCriterion = "PreemptionFailed" +) + + +type TopologySpreadConstraint struct { +...... + // FallbackCriteria is the list of criteria that the scheduler decides when to fall back from DoNotSchedule to ScheduleAnyway. + // It's valid to set only when WhenUnsatisfiable is DoNotSchedule. + // If multiple criteria are in this list, the scheduler falls back when ALL criteria in `FallbackCriterion` are satisfied. + // It's an optional field. The default value is nil, meaning the scheduler never falls back. + // +optional + FallbackCriteria []FallbackCriterion +} + +// These are valid conditions of pod. +const ( +...... + // NodeProvisioningInProgress indicates that the Pod triggered scaling up the cluster. + // If it's true, new Node for the Pod was successfully created. + // Otherwise, new Node for the Pod tried to be created, but failed. + NodeProvisioningInProgress PodConditionType = "NodeProvisioningInProgress" +) +``` + +### NodeProvisioningFailed + +`NodeProvisioningFailed` is used to fallback when the Pod doesn't trigger scaling up the cluster. +`NodeProvisioningInProgress` is a new condition to show whether the Pod triggers scaling up the cluster, +which creates new Node for Pod typically by the cluster autoscaler. + +**fallback scenario** + +1. Pod is rejected and stays unschedulable. +2. The cluster autoscaler finds those unschedulable Pod(s) but cannot create Nodes because of stockouts. +3. The cluster autoscaler adds `NodeProvisioningInProgress: false`. +4. The scheduler notices `NodeProvisioningInProgress: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread. + +### [Beta] `nodeProvisioningTimeout` in the scheduler configuration + +_This is targeting beta._ + +We'll implement `NodeProvisioningTimeout` to address the additional fallback cases, +for example, when the cluster autoscaler is down, or the cluster autoscaler takes longer time than usual. + +```go +type PodTopologySpreadArgs struct { + // NodeProvisioningTimeout defines the time that the scheduler waits for the cluster autoscaler to create nodes for pending pods rejected by Pod Topology Spread. + // If the cluster autoscaler hasn't put any value on `NodeProvisioningInProgress` condition for this period of time, + // the plugin triggers the fallback for topology spread constraints with `NodeProvisioningFailed` in `FallbackCriteria`. + // This is for the use cases like needing the fallback when the cluster autoscaler is down or taking too long time to react. + // Note that we don't guarantee that `NodeProvisioningTimeout` means the pods are going to be retried exactly after this timeout period. + // The scheduler will surely retry those pods, but there might be some delay, depending on other pending pods, those pods' backoff time, and the scheduling queue's processing timing. + // + // This is optional; If it's empty, `NodeProvisioningFailed` in `FallbackCriteria` is only handled when the cluster autoscaler puts `NodeProvisioningInProgress: false`. + NodeProvisioningTimeout *metav1.Duration +} +``` + +One difficulty here is: how we move pods rejected by the PodTopologySpread plugin to activeQ/backoffQ when the timeout is reached and the fallback should be triggered. +Currently, all the requeueing is triggered by a cluster event and we don't have any capability to trigger it by time since it's put in the unschedulable pod pool. + +We'll need to implement a new special cluster event, `Resource: Time`. +The PodTopologySpread plugin (or other plugins, if they need) would use it in `EventsToRegister` like this: + +```go +// It means pods rejected by this plugin may become schedulable by the time flies. +// isSchedulableAfterTimePasses is called periodically with rejected pods. +{Event: fwk.ClusterEvent{Resource: fwk.Time}, QueueingHintFn: pl.isSchedulableAfterTimePasses} +``` + +At the scheduling queue, we'll have a new function `triggerTimeBasedQueueingHints`, which is triggered periodically, like `flushBackoffQCompleted`. +In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins, +and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual. + +`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough. +This is because: +- Triggering `triggerTimeBasedQueueingHints` very often could impact the scheduling throughput because of the queue's lock. +- Even if pods were requeued exactly after `NodeProvisioningTimeout` passed, either way, those pods might have to wait for the backoff time to be completed, +and for other pods in activeQ to be handled. + +For this reason, as you see in the above `NodeProvisioningTimeout` comment, we would **not** guarantee that `NodeProvisioningTimeout` means the pods are going to be retried exactly after the timeout period. + +As a summary, the `NodeProvisioningTimeout` config will work like this: +1. Pod with `NodeProvisioningFailed` in `FallbackCriteria` is rejected by the PodTopologySpread plugin. +2. There's no cluster event that the PodTopologySpread plugin requeues the pod with. +3. The cluster autoscaler somehow doesn't react to this pod. Maybe it's down. +4. The scheduling queue triggers `triggerTimeBasedQueueingHints` periodically, and `triggerTimeBasedQueueingHints` invokes the PodTopologySpread plugin's QHint for `Resource: Type` event. +5. `NodeProvisioningTimeout` is reached: the PodTopologySpread plugin's QHint for `Resource: Type` event returns `Queue` by comparing the pod's last scheduling time and `NodeProvisioningTimeout`. +6. The pod is retried, and the PodTopologySpread plugin regards TopologySpreadConstraint with `NodeProvisioningFailed` in `FallbackCriteria` as `ScheduleAnyway`. (fallback is triggered) + + +#### How we implement `NodeProvisioningInProgress` in the cluster autoscaler + +Basically, we just put `NodeProvisioningInProgress: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296). + +This `status.ScaleUpStatus.PodsRemainUnschedulable` contains Pods that the cluster autoscaler [simulates](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L536) the scheduling process for and determines that Pods wouldn't be schedulable in any node group. + +So, for a simple example, +if a Pod has 64 cpu request, but no node group can satisfy 64 cpu requirement, +the Pod would be in `status.ScaleUpStatus.PodsRemainUnschedulable`; get `NodeProvisioningInProgress: false`. + +A complicated scenario could also be covered by this way; +supposing a Pod has 64 cpu request and only a node group can satisfy 64 cpu requirement, +but the node group is running out of instances at the moment. +In this case, the first reconciliation selects the node group to make the Pod schedulable, +but the node group size increase request would be rejected by the cloud provider because of the stockout. +The node group is then considered to be non-safe for a while, +and the next reconciliation happens without taking the failed node group into account. +As said, there's no other node group that can satisfy 64 cpu requirement, +and then the Pod would be finally in `status.ScaleUpStatus.PodsRemainUnschedulable`; get `NodeProvisioningInProgress: false`. + +### PreemptionFalied + +`PreemptionFailed` is used to fallback when preemption is failed. +Pod Topology Spread can notice the preemption failure +by `PodScheduled: false` (the past scheduling failed) and empty `NominatedNodename` (the past postfilter did nothing for this Pod). + +**fallback scenario** + +1. Pod is rejected in the scheduling cycle. +2. In the PostFilter extension point, the scheduler tries to make space by the preemption, but finds the preemption doesn't help. +3. When the Pod is moved back to the scheduling queue, the scheduler adds `PodScheduled: false` condition to Pod. +4. The scheduler notices that the preemption wasn't performed for Pod by `PodScheduled: false` and empty `NominatedNodeName` on the Pod. +And, it schedules the Pod while falling back to `ScheduleAnyway` on Pod Topology Spread. + +### What if are both specified in `FallbackCriterion`? + +The scheduler fallbacks when all criteria in `FallbackCriterion` are satisfied. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2023-08-12` - `87%` +- `k8s.io/kubernetes/pkg/api/pod`: `2023-08-12` - `76.6%` +- `k8s.io/kubernetes/pkg/apis/core/validation`: `2023-08-12` - `83.6%` + +##### Integration tests + + + + + +test: https://github.com/kubernetes/kubernetes/blob/6e0cb243d57592c917fe449dde20b0e246bc66be/test/integration/scheduler/filters/filters_test.go#L1066 +k8s-triage: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling&test=TestPodTopologySpreadFilter + +##### e2e tests + + + +N/A + +-- + +This feature doesn't introduce any new API endpoints and doesn't interact with other components. +So, E2E tests doesn't add extra value to integration tests. + +### Graduation Criteria + + + +#### Alpha + +- [] The feature gate is added, which is disabled by default. +- [] Add a new field `fallbackCriteria` to `TopologySpreadConstraint` and feature gating. + - [] implement `NodeProvisioningFailed` to fallback when CA fails to create new Node for Pod. + - [] implement `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable. +- [] introduce `NodeProvisioningInProgress` in Pod condition +- [] Implement all tests mentioned in the [Test Plan](#test-plan). + +Out of Kubernetes, but: +- [] (cluster autoscaler) set `NodeProvisioningInProgress` after trying to create Node for Pod. + +#### Beta + +- The feature gate is enabled by default. + +#### GA + +- No negative feedback. +- No bug issues reported. + +### Upgrade / Downgrade Strategy + + + +**Upgrade** + +The previous Pod Topology Spread behavior will not be broken. Users can continue to use +their Pod specs as it is. + +To use this enhancement, users need to enable the feature gate (during this feature is in the alpha.), +and add `fallbackCriteria` on their `TopologySpreadConstraint`. + +Also, if users want to use `NodeProvisioningFailed`, they need to use the cluster autoscaler +that supports `NodeProvisioningInProgress` Pod condition. + +**Downgrade** + +kube-apiserver will reject Pod creation with `fallbackCriteria` in `TopologySpreadConstraint`. +Regarding existing Pods, we keep `fallbackCriteria`, but the scheduler ignores them. + +### Version Skew Strategy + + + +N/A + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `PodTopologySpreadFallbackMode` + - Components depending on the feature gate: + - kube-scheduler + - kube-apiserver + +###### Does enabling the feature change any default behavior? + + + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +The feature can be disabled in Alpha and Beta versions +by restarting kube-apiserver and kube-apiserver with the feature-gate off. +In terms of Stable versions, users can choose to opt-out by not setting the +`fallbackCriteria` field. + +###### What happens if we reenable the feature if it was previously rolled back? + +Scheduling of pods with `fallbackCriteria` is affected. + +###### Are there any tests for feature enablement/disablement? + + + +No. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +- 2023-08-12: Initial KEP PR is submitted. + +## Drawbacks + + + +## Alternatives + + + +### introduce `DoNotScheduleUntilNodeProvisioningFailed` and `DoNotScheduleUntilPreemptionFailed` + +Instead of `FallBackCriteria`, introduce `DoNotScheduleUntilNodeProvisioningFailed` and `DoNotScheduleUntilPreemptionFailed` in `WhenUnsatisfiable`. +`DoNotScheduleUntilNodeProvisioningFailed` corresponds to `NodeProvisioningFailed`, +and `DoNotScheduleUntilPreemptionFailed` corresponds to `PreemptionFailed`. + +We noticed a downside in this way, compared to `FallBackCriteria`. +In other scheduling constraints, we distinguish between preferred and required constraint by where the constraint is written in. +For example, PodAffinity and NodeAffinity, if it's written in `requiredDuringSchedulingIgnoredDuringExecution`, it's required. +And if it's written in `preferredDuringSchedulingIgnoredDuringExecution`, it's preferred. + +In the future, we may want to introduce similar fallback mechanism in such other scheduling constraints, +but, we couldn't make the similar API design if we went with `DoNotScheduleUntilNodeProvisioningFailed` and `DoNotScheduleUntilPreemptionFailed`, +as they don't define preferred or required in enum value like `WhenUnsatisfiable`. + +On the other hand, `FallBackCriteria` allows us to unify APIs in all scheduling constraints. +We will just introduce `FallBackCriteria` field in them and there we go. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/kep.yaml b/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/kep.yaml new file mode 100644 index 00000000000..4598ea51f51 --- /dev/null +++ b/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/kep.yaml @@ -0,0 +1,35 @@ +title: Pod Topology Spread DoNotSchedule to SchedulingAnyway fallback mode +kep-number: 3990 +authors: + - "@sanposhiho" +owning-sig: sig-scheduling +participating-sigs: + - sig-scheduling + - sig-autoscaling +status: provisional +creation-date: 2023-08-08 +reviewers: + - "@alculquicondor" + - "@MaciekPytel" +approvers: + - "@alculquicondor" + - "@MaciekPytel" + +see-also: + - "/keps/sig-scheduling/895-pod-topology-spread" + +stage: alpha + +latest-milestone: "v1.29" + +milestone: + alpha: "v1.29" + beta: "v1.30" + stable: "v1.32" + +feature-gates: + - name: PodTopologySpreadFallbackMode + components: + - kube-scheduler + - kube-apiserver +disable-supported: true