You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[the fallback could be done when it's actually not needed.](#the-fallback-could-be-done-when-its-actually-not-needed)
92
93
-[Design Details](#design-details)
93
94
-[new API changes](#new-api-changes)
94
95
-[ScaleUpFailed](#scaleupfailed)
96
+
-[[Beta] <code>scaleupTimeout</code> in the scheduler configuration](#beta-scaleuptimeout-in-the-scheduler-configuration)
95
97
-[How we implement <code>TriggeredScaleUp</code> in the cluster autoscaler](#how-we-implement-triggeredscaleup-in-the-cluster-autoscaler)
96
98
-[PreemptionFalied](#preemptionfalied)
97
99
-[What if are both specified in <code>FallbackCriterion</code>?](#what-if-are-both-specified-in-fallbackcriterion)
@@ -215,6 +217,7 @@ know that this has succeeded?
215
217
- A new field `fallbackCriteria` is introduced to `PodSpec.TopologySpreadConstraint[*]`
216
218
-`ScaleUpFailed` to fallback when the cluster autoscaler fails to create new Node for Pod.
217
219
-`PreemptionFailed` to fallback when preemption doesn't help make Pod schedulable.
220
+
- A new config field `scaleupTimeout` is introduced to the PodTopologySpread plugin's configuration.
218
221
- introduce `TriggeredScaleUp` in Pod condition
219
222
- change the cluster autoscaler to set it `false` when it cannot create new Node for the Pod, `true` when success.
220
223
@@ -273,6 +276,26 @@ topologySpreadConstraints:
273
276
274
277
#### Story 2
275
278
279
+
Similar to Story 1, but additionally you want to fallback when the cluster autoscaler doesn't react within the certain time.
280
+
In that case, maybe the cluster autoscaler is down, or it takes too long time to handle pods.
281
+
282
+
In this case, in addition to use `ScaleUpFailed` in `fallbackCriteria` like Story 1,
283
+
the cluster admin can use `scaleupTimeout` in the scheduler configuration.
284
+
285
+
```yaml
286
+
apiVersion: kubescheduler.config.k8s.io/v1
287
+
kind: KubeSchedulerConfiguration
288
+
profiles:
289
+
- schedulerName: default-scheduler
290
+
pluginConfig:
291
+
- name: PodTopologySpread
292
+
args:
293
+
# trigger the fallback if a pending pod has been unschedulable for 5 min, but the cluster autoscaler hasn't yet react
294
+
scaleupTimeout: 5m
295
+
```
296
+
297
+
#### Story 3
298
+
276
299
Your cluster doesn't have the cluster autoscaler
277
300
and has some low-priority Pods to make space (often called overprovisional Pods, balloon Pods, etc.).
278
301
Basically, you want to leverage preemption to achieve the best distribution as much as possible,
@@ -379,6 +402,60 @@ which creates new Node for Pod typically by the cluster autoscaler.
379
402
3. The cluster autoscaler adds `TriggeredScaleUp: false`.
380
403
4. The scheduler notices `TriggeredScaleUp: false` on Pod and schedules that Pod while falling back to `ScheduleAnyway` on Pod Topology Spread.
381
404
405
+
### [Beta]`scaleupTimeout` in the scheduler configuration
406
+
407
+
_This is targeting beta._
408
+
409
+
We'll implement `ScaleupTimeout` to address the additional fallback cases,
410
+
for example, when the cluster autoscaler is down, or the cluster autoscaler takes longer time than usual.
411
+
412
+
```go
413
+
typePodTopologySpreadArgsstruct {
414
+
// ScaleupTimeout defines the time that the scheduler waits for the cluster autoscaler to create nodes for pending pods rejected by Pod Topology Spread.
415
+
// If the cluster autoscaler hasn't put any value on `TriggeredScaleUp` condition for this period of time,
416
+
// the plugin triggers the fallback for topology spread constraints with `ScaleUpFailed` in `FallbackCriteria`.
417
+
// This is for the use cases like needing the fallback when the cluster autoscaler is down or taking too long time to react.
418
+
// Note that we don't guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after this timeout period.
419
+
// The scheduler will surely retry those pods, but there might be some delay, depending on other pending pods, those pods' backoff time, and the scheduling queue's processing timing.
420
+
//
421
+
// This is optional; If it's empty, `ScaleUpFailed` in `FallbackCriteria` is only handled when the cluster autoscaler puts `TriggeredScaleUp: false`.
422
+
ScaleupTimeout *metav1.Duration
423
+
}
424
+
```
425
+
426
+
One difficulty here is: how we move pods rejected by the PodTopologySpread plugin to activeQ/backoffQ when the timeout is reached and the fallback should be triggered.
427
+
Currently, all the requeueing is triggered by a cluster event and we don't have any capability to trigger it by time since it's put in the unschedulable pod pool.
428
+
429
+
We'll need to implement a new special cluster event, `Resource: Time`.
430
+
The PodTopologySpread plugin (or other plugins, if they need) would use it in `EventsToRegister` like this:
431
+
432
+
```go
433
+
// It means pods rejected by this plugin may become schedulable by the time flies.
434
+
// isSchedulableAfterTimePasses is called periodically with rejected pods.
At the scheduling queue, we'll have a new function `triggerTimeBasedQueueingHints`, which is triggered periodically, like `flushBackoffQCompleted`.
439
+
In `triggerTimeBasedQueueingHints`, Queueing Hints with the `Resource: Type` event are triggered for pods rejected by those plugins,
440
+
and the scheduling queue requeues/doesn't requeue pods based on QHints, as usual.
441
+
442
+
`triggerTimeBasedQueueingHints` is triggered periodically, **but not very often**. Probably once 30 sec is enough.
443
+
This is because:
444
+
- Triggering `triggerTimeBasedQueueingHints` very often could impact the scheduling throughput because of the queue's lock.
445
+
- Even if pods were requeued exactly after `ScaleupTimeout` passed, either way, those pods might have to wait for the backoff time to be completed,
446
+
and for other pods in activeQ to be handled.
447
+
448
+
For this reason, as you see in the above `ScaleupTimeout` comment, we would **not** guarantee that `ScaleupTimeout` means the pods are going to be retried exactly after the timeout period.
449
+
450
+
As a summary, the `ScaleupTimeout` config will work like this:
451
+
1. Pod with `ScaleUpFailed` in `FallbackCriteria` is rejected by the PodTopologySpread plugin.
452
+
2. There's no cluster event that the PodTopologySpread plugin requeues the pod with.
453
+
3. The cluster autoscaler somehow doesn't react to this pod. Maybe it's down.
454
+
4. The scheduling queue triggers `triggerTimeBasedQueueingHints` periodically, and `triggerTimeBasedQueueingHints` invokes the PodTopologySpread plugin's QHint for `Resource: Type` event.
455
+
5.`ScaleupTimeout` is reached: the PodTopologySpread plugin's QHint for `Resource: Type` event returns `Queue` by comparing the pod's last scheduling time and `ScaleupTimeout`.
456
+
6. The pod is retried, and the PodTopologySpread plugin regards TopologySpreadConstraint with `ScaleUpFailed` in `FallbackCriteria` as `ScheduleAnyway`. (fallback is triggered)
457
+
458
+
382
459
#### How we implement `TriggeredScaleUp` in the cluster autoscaler
383
460
384
461
Basically, we just put `TriggeredScaleUp: false` for Pods in [status.ScaleUpStatus.PodsRemainUnschedulable](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/processors/status/scale_up_status_processor.go#L37) every [reconciliation (RunOnce)](https://github.com/kubernetes/autoscaler/blob/109998dbf30e6a6ef84fc37ebaccca23d7dee2f3/cluster-autoscaler/core/static_autoscaler.go#L296).
0 commit comments