Skip to content

Commit 2c5cfac

Browse files
committed
feat: add scaleupTimeout
1 parent 9db7b55 commit 2c5cfac

File tree

1 file changed

+75
-0
lines changed
  • keps/sig-scheduling/3990-pod-topology-spread-fallback-mode

1 file changed

+75
-0
lines changed

keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ know that this has succeeded?
215215
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
216216
- `ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
217217
- `PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable.
218+
- A new config field `scaleupTimeout` is introduced to the PodTopologySpread plugin's configuration.
218219
- introduce `TriggeredScaleUp` in Pod condition
219220
- change the cluster autoscaler to set it `false` when it cannot create new Node for the Pod, `true` when success.
220221

@@ -273,6 +274,26 @@ topologySpreadConstraints:
273274
274275
#### Story 2
275276
277+
Similar to Story 1, but additionally you want to fallback when the cluster autoscaler doesn't react within the certain time.
278+
In that case, maybe the cluster autoscaler is down, or it takes too long time to handle pods.
279+
280+
In this case, in addition to use `ScaleUpFailed` in `fallbackCriteria` like Story 1,
281+
the cluster admin can use `scaleupTimeout` in the scheduler configuration.
282+
283+
```yaml
284+
apiVersion: kubescheduler.config.k8s.io/v1
285+
kind: KubeSchedulerConfiguration
286+
profiles:
287+
- schedulerName: default-scheduler
288+
pluginConfig:
289+
- name: PodTopologySpread
290+
args:
291+
# trigger the fallback if a pending pod has been unschedulable for 5 min, but the cluster autoscaler hasn't yet react
292+
scaleupTimeout: 5m
293+
```
294+
295+
#### Story 3
296+
276297
Your cluster doesn't have the cluster autoscaler
277298
and has some low-priority Pods to make space (often called overprovisional Pods, balloon Pods, etc.).
278299
Basically, you want to leverage preemption to achieve the best distribution as much as possible,
@@ -379,6 +400,60 @@ which creates new Node for Pod typically by the cluster autoscaler.
379400
3. The cluster autoscaler adds `TriggeredScaleUp: false`.
380401
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.
381402

403+
### [Beta] `scaleupTimeout` in the scheduler configuration
404+
405+
_This is targetting beta._
406+
407+
We'll implement `ScaleupTimeout` to address the additional fallback cases,
408+
for example, when the cluster autoscaler is down, or the cluster autoscaler takes longer time than usual.
409+
410+
```go
411+
type PodTopologySpreadArgs struct {
412+
// ScaleupTimeout defines the time that the scheduler waits for the cluster autoscaler to create nodes for pending pods rejected by Pod Topology Spread.
413+
// If the cluster autoscaler hasn't put any value on `TriggeredScaleUp` condition for this period of time,
414+
// the plugin triggers the fallback for topology spread constraints with `ScaleUpFailed` in `FallbackCriteria`.
415+
// This is for the use cases like needing the fallback when the cluster autoscaler is down or taking too long time to react.
416+
// Note that we don't guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after this timeout period.
417+
// The scheduler will surely retry those pods, but there might be some delay, depending on other pending pods, those pods' backoff time, and the scheduling queue's processing timing.
418+
//
419+
// This is optional; If it's empty, `ScaleUpFailed` in `FallbackCriteria` is only handled when the cluster autoscaler puts `TriggeredScaleUp: false`.
420+
ScaleupTimeout *metav1.Duration
421+
}
422+
```
423+
424+
One difficulty here is: how we move pods rejected by the PodTopologySpread plugin to activeQ/backoffQ when the timeout is reached and the fallback should be triggered.
425+
Currently, all the requeueing is triggered by a cluster event and we don't have any capability to trigger it by time since it's put in the unschedulable pod pool.
426+
427+
We'll need to implement a new special cluster event, `Resource: Time`.
428+
The PodTopologySpread plugin (or other plugins, if they need) would use it in `EventsToRegister` like this:
429+
430+
```go
431+
// It means pods rejected by this plugin may become schedulable by the time flies.
432+
// isSchedulableAfterTimePasses is called periodically with rejected pods.
433+
{Event: fwk.ClusterEvent{Resource: fwk.Time}, QueueingHintFn: pl.isSchedulableAfterTimePasses}
434+
```
435+
436+
At the scheduling queue, we'll have a new function `triggerTimeBasedQueueingHints`, which is triggered periodically, like `flushBackoffQCompleted`.
437+
In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins,
438+
and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual.
439+
440+
`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough.
441+
This is because:
442+
- Triggering `triggerTimeBasedQueueingHints` very often could impact the scheduling throughput because of the queue's lock.
443+
- Even if pods were requeued exactly after `ScaleupTimeout` passed, either way, those pods might have to wait for the backoff time to be completed,
444+
and for other pods in activeQ to be handled.
445+
446+
For this reason, as you see in the above `ScaleupTimeout` comment, we would **not** guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after the timeout period.
447+
448+
As a summary, the `ScaleupTimeout` config will work like this:
449+
1. Pod with `ScaleUpFailed` in `FallbackCriteria` is rejected by the PodTopologySpread plugin.
450+
2. There's no cluster event that the PodTopologySpread plugin requeues the pod with.
451+
3. The cluster autoscaler somehow doesn't react to this pod. Maybe it's down.
452+
4. The scheduling queue triggers `triggerTimeBasedQueueingHints` periodically, and `triggerTimeBasedQueueingHints` invokes the PodTopologySpread plugin's QHint for `Resource: Type` event.
453+
5. `ScaleupTimeout` is reached: the PodTopologySpread plugin's QHint for `Resource: Type` event returns `Queue` by comparing the pod's last scheduling time and `ScaleupTimeout`.
454+
6. The pod is retried, and the PodTopologySpread plugin regards TopologySpreadConstraint with `ScaleUpFailed` in `FallbackCriteria` as `ScheduleAnyway`. (fallback is triggered)
455+
456+
382457
#### How we implement `TriggeredScaleUp` in the cluster autoscaler
383458

384459
Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).

0 commit comments

Comments
 (0)