Skip to content

Commit 5337338

Browse files
committed
feat: add scaleupTimeout
1 parent 9db7b55 commit 5337338

File tree

1 file changed

+77
-0
lines changed
  • keps/sig-scheduling/3990-pod-topology-spread-fallback-mode

1 file changed

+77
-0
lines changed

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,12 +86,14 @@ tags, and then generate with `hack/update-toc.sh`.
8686
- [User Stories (Optional)](#user-stories-optional)
8787
- [Story 1](#story-1)
8888
- [Story 2](#story-2)
89+
- [Story 3](#story-3)
8990
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
9091
- [Risks and Mitigations](#risks-and-mitigations)
9192
- [the fallback could be done when it's actually not needed.](#the-fallback-could-be-done-when-its-actually-not-needed)
9293
- [Design Details](#design-details)
9394
- [new API changes](#new-api-changes)
9495
- [ScaleUpFailed](#scaleupfailed)
96+
- [[Beta] <code>scaleupTimeout</code> in the scheduler configuration](#beta-scaleuptimeout-in-the-scheduler-configuration)
9597
- [How we implement <code>TriggeredScaleUp</code> in the cluster autoscaler](#how-we-implement-triggeredscaleup-in-the-cluster-autoscaler)
9698
- [PreemptionFalied](#preemptionfalied)
9799
- [What if are both specified in <code>FallbackCriterion</code>?](#what-if-are-both-specified-in-fallbackcriterion)
@@ -215,6 +217,7 @@ know that this has succeeded?
215217
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
216218
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
217219
- `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable.
220+
- A new config field `scaleupTimeout` is introduced to the PodTopologySpread plugin's configuration.
218221
- introduce `TriggeredScaleUp` in Pod condition
219222
- change the cluster autoscaler to set it `false` when it cannot create new Node for the Pod, `true` when success.
220223

@@ -273,6 +276,26 @@ topologySpreadConstraints:
273276
274277
#### Story 2
275278
279+
Similar to Story 1, but additionally you want to fallback when the cluster autoscaler doesn't react within the certain time.
280+
In that case, maybe the cluster autoscaler is down, or it takes too long time to handle pods.
281+
282+
In this case, in addition to use `ScaleUpFailed` in `fallbackCriteria` like Story 1,
283+
the cluster admin can use `scaleupTimeout` in the scheduler configuration.
284+
285+
```yaml
286+
apiVersion: kubescheduler.config.k8s.io/v1
287+
kind: KubeSchedulerConfiguration
288+
profiles:
289+
- schedulerName: default-scheduler
290+
pluginConfig:
291+
- name: PodTopologySpread
292+
args:
293+
# trigger the fallback if a pending pod has been unschedulable for 5 min, but the cluster autoscaler hasn't yet react
294+
scaleupTimeout: 5m
295+
```
296+
297+
#### Story 3
298+
276299
Your cluster doesn't have the cluster autoscaler
277300
and has some low-priority Pods to make space (often called overprovisional Pods, balloon Pods, etc.).
278301
Basically, you want to leverage preemption to achieve the best distribution as much as possible,
@@ -379,6 +402,60 @@ which creates new Node for Pod typically by the cluster autoscaler.
379402
3. The cluster autoscaler adds `TriggeredScaleUp: false`.
380403
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.
381404

405+
### [Beta] `scaleupTimeout` in the scheduler configuration
406+
407+
_This is targeting beta._
408+
409+
We'll implement `ScaleupTimeout` to address the additional fallback cases,
410+
for example, when the cluster autoscaler is down, or the cluster autoscaler takes longer time than usual.
411+
412+
```go
413+
type PodTopologySpreadArgs struct {
414+
// ScaleupTimeout defines the time that the scheduler waits for the cluster autoscaler to create nodes for pending pods rejected by Pod Topology Spread.
415+
// If the cluster autoscaler hasn't put any value on `TriggeredScaleUp` condition for this period of time,
416+
// the plugin triggers the fallback for topology spread constraints with `ScaleUpFailed` in `FallbackCriteria`.
417+
// This is for the use cases like needing the fallback when the cluster autoscaler is down or taking too long time to react.
418+
// Note that we don't guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after this timeout period.
419+
// The scheduler will surely retry those pods, but there might be some delay, depending on other pending pods, those pods' backoff time, and the scheduling queue's processing timing.
420+
//
421+
// This is optional; If it's empty, `ScaleUpFailed` in `FallbackCriteria` is only handled when the cluster autoscaler puts `TriggeredScaleUp: false`.
422+
ScaleupTimeout *metav1.Duration
423+
}
424+
```
425+
426+
One difficulty here is: how we move pods rejected by the PodTopologySpread plugin to activeQ/backoffQ when the timeout is reached and the fallback should be triggered.
427+
Currently, all the requeueing is triggered by a cluster event and we don't have any capability to trigger it by time since it's put in the unschedulable pod pool.
428+
429+
We'll need to implement a new special cluster event, `Resource: Time`.
430+
The PodTopologySpread plugin (or other plugins, if they need) would use it in `EventsToRegister` like this:
431+
432+
```go
433+
// It means pods rejected by this plugin may become schedulable by the time flies.
434+
// isSchedulableAfterTimePasses is called periodically with rejected pods.
435+
{Event: fwk.ClusterEvent{Resource: fwk.Time}, QueueingHintFn: pl.isSchedulableAfterTimePasses}
436+
```
437+
438+
At the scheduling queue, we'll have a new function `triggerTimeBasedQueueingHints`, which is triggered periodically, like `flushBackoffQCompleted`.
439+
In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins,
440+
and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual.
441+
442+
`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough.
443+
This is because:
444+
- Triggering `triggerTimeBasedQueueingHints` very often could impact the scheduling throughput because of the queue's lock.
445+
- Even if pods were requeued exactly after `ScaleupTimeout` passed, either way, those pods might have to wait for the backoff time to be completed,
446+
and for other pods in activeQ to be handled.
447+
448+
For this reason, as you see in the above `ScaleupTimeout` comment, we would **not** guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after the timeout period.
449+
450+
As a summary, the `ScaleupTimeout` config will work like this:
451+
1. Pod with `ScaleUpFailed` in `FallbackCriteria` is rejected by the PodTopologySpread plugin.
452+
2. There's no cluster event that the PodTopologySpread plugin requeues the pod with.
453+
3. The cluster autoscaler somehow doesn't react to this pod. Maybe it's down.
454+
4. The scheduling queue triggers `triggerTimeBasedQueueingHints` periodically, and `triggerTimeBasedQueueingHints` invokes the PodTopologySpread plugin's QHint for `Resource: Type` event.
455+
5. `ScaleupTimeout` is reached: the PodTopologySpread plugin's QHint for `Resource: Type` event returns `Queue` by comparing the pod's last scheduling time and `ScaleupTimeout`.
456+
6. The pod is retried, and the PodTopologySpread plugin regards TopologySpreadConstraint with `ScaleUpFailed` in `FallbackCriteria` as `ScheduleAnyway`. (fallback is triggered)
457+
458+
382459
#### How we implement `TriggeredScaleUp` in the cluster autoscaler
383460

384461
Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).

0 commit comments

Comments
 (0)