Skip to content

Commit 66bccdf

Browse files
committed
Address review comments
Co-authored-by: kishen-v <kishen.viswanathan@ibm.com>
1 parent e8d8128 commit 66bccdf

File tree

2 files changed

+173
-108
lines changed

2 files changed

+173
-108
lines changed

keps/sig-node/3953-node-resource-hot-plug/README.md

Lines changed: 172 additions & 107 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,6 @@ tags, and then generate with `hack/update-toc.sh`.
1818
- [Goals](#goals)
1919
- [Non-Goals](#non-goals)
2020
- [Proposal](#proposal)
21-
- [Handling HotUnplug Events](#handling-hotunplug-events)
22-
- [Flow Control](#flow-control)
2321
- [User Stories](#user-stories)
2422
- [Story 1](#story-1)
2523
- [Story 2](#story-2)
@@ -29,11 +27,15 @@ tags, and then generate with `hack/update-toc.sh`.
2927
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
3028
- [Risks and Mitigations](#risks-and-mitigations)
3129
- [Design Details](#design-details)
30+
- [Handling hotplug events](#handling-hotplug-events)
31+
- [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
32+
- [Handling HotUnplug Events](#handling-hotunplug-events)
33+
- [Flow Control](#flow-control)
3234
- [Test Plan](#test-plan)
3335
- [Unit tests](#unit-tests)
3436
- [e2e tests](#e2e-tests)
3537
- [Graduation Criteria](#graduation-criteria)
36-
- [Phase 1: Alpha (target 1.33)](#phase-1-alpha-target-133)
38+
- [Phase 1: Alpha (target 1.34)](#phase-1-alpha-target-134)
3739
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
3840
- [Upgrade](#upgrade)
3941
- [Downgrade](#downgrade)
@@ -147,55 +149,11 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
147149

148150
## Proposal
149151

150-
This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
151-
The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
152+
This KEP strives to enable node resource hot plugging by making the kubelet to watch and retrieve machine resource information from cAdvisor's cache as and when it changes, cAdvisor's cache is already updated periodically.
153+
The kubelet will fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
152154
Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
153155
With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj.
154156

155-
### Handling HotUnplug Events
156-
157-
Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
158-
For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
159-
160-
As the hotunplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
161-
is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.
162-
163-
Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
164-
In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
165-
166-
#### Flow Control
167-
168-
```
169-
T=0: Node initial Resources:
170-
- Memory: 10G
171-
- Node state: Ready
172-
173-
T=1: Resize Instance to Hotplug Memory
174-
- Current Memory: 10G
175-
- Update Memory: 15G
176-
- Node state: Ready
177-
178-
T=2: Resize Instance to HotUnplug Memory
179-
- Current Memory: 15G
180-
- UpdatedMemory: 5G
181-
- Node state: NotReady
182-
183-
T=3: Resize Instance to Hotplug Memory
184-
- Current Memory: 5G
185-
- Updated Memory Size: 15G
186-
- Node state: Ready
187-
```
188-
189-
Few of the concerns surrounding hotunplug are listed below
190-
* Pod re-admission:
191-
* Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
192-
or if it has to be terminated due to resource crunch.
193-
* Recalculate OOM adjust score and Swap limits:
194-
* Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
195-
* Handling unplug of reserved CPUs.
196-
197-
we intend to propose a separate KEP dedicated to hotunplug of resources to address the same.
198-
199157
### User Stories
200158

201159
#### Story 1
@@ -247,94 +205,197 @@ detect the change in compute capacity, which can bring in additional complicatio
247205
- Lack of coordination about change in resource availability across kubelet/runtime/plugins.
248206
- The plugins/runtime should be updated to react to change in resource information on the node.
249207

250-
- Kubelet missing hotplug event or too many hotplug events
251-
- Hotplug events are captured via periodic polling by the kubelet, this ensures that the capacity is updated in the poll cycle and can technically not miss the event/fail to handle a flood of events.
208+
- Kubelet missing on processing hotplug instance(s)
209+
- Kubelet observes the underlying node for any hotplug of resources as and when generated,
210+
this ensures that the capacity is updated in set intervals and can technically not miss to update the actual capacity obtained from cAdvisor.
252211

253212
- Handling downsize events
254-
- Though there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
255-
- However, enabling this feature will ensure that the correct resource information is pushed across the cluster.
213+
- Though, there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
214+
- However, in a situation of downsize an error mode is returned by the kubelet and the node is marked as `NotReady`.
215+
216+
- Workloads that are dependent on the initial node configuration, such as:
217+
- Workloads that spawns per-CPU processes (threads, workpools, etc.)
218+
- Workloads that depend on the CPU-Memory relationships (e.g Processes that depend on NUMA/NUMA alignment.)
219+
- Dependency of external libraries/device drivers to support CPU hotplug as a supported feature.
220+
256221

257222
## Design Details
258223

259224
Below diagram is shows the interaction between kubelet, node and cAdvisor.
260225

261-
262226
```mermaid
263227
sequenceDiagram
264228
participant node
265229
participant kubelet
266230
participant cAdvisor-cache
267231
participant machine-info
268-
kubelet->>cAdvisor-cache: Poll
232+
kubelet->>cAdvisor-cache: fetch
269233
cAdvisor-cache->>machine-info: fetch
270234
machine-info->>cAdvisor-cache: update
271235
cAdvisor-cache->>kubelet: update
272-
kubelet->>node: node status update
236+
alt if increase in resource
237+
kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
273238
kubelet->>node: re-initialize resource managers
274-
kubelet->>node: Recalculate and update OOMScoreAdj <br> and Swap limit of pods
239+
kubelet->>node: node status update with new capacity
240+
else if decrease in resource
241+
kubelet->>node: set node status to not ready
242+
end
275243
```
276244

277-
The interaction sequence is as follows
278-
1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes.
279-
2. Kubelet's cache will be updated with the latest machine resource information.
280-
3. Node status updater will update the node's status with the latest resource information.
281-
4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
282-
5. Kubelet will recalculate and update the OOMScoreAdj and swap limit for the existing pods.
283-
284-
With increase in cluster resources the following components will be updated
285-
286-
1. Scheduler
287-
* Scheduler will automatically schedule any pending pods.
288-
289-
2. Update in Node allocatable capacity.
290-
291-
3. Resource managers will re-initialised.
292-
293-
4. Change in OOM score adjust
245+
The interaction sequence is as follows:
246+
1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`.
247+
2. If the machine resource is increased:
248+
* Recalculate, update OOMScoreAdj and Swap limit of all the running containers.
249+
* Re-initialize resource managers.
250+
* Update node with new resource.
251+
3. If the machine resource is decreased.
252+
* Set node status to not ready. (This will be reverted when the current capacity exceeds or matches either the previous hot-plug capacity or the initial capacity
253+
in case there was no history of hotplug.)
254+
255+
With increase in cluster resources the following components will be updated:
256+
1. Change in OOM score adjust:
294257
* Currently, the OOM score adjust is calculated by
295258
`1000 - (1000*containerMemReq)/memoryCapacity`
296259
* Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
297260

298-
5. Change in Swap Memory limit
261+
2. Change in Swap Memory limit:
299262
* Currently, the swap memory limit is calculated by
300263
`(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
301264
* Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
302265

266+
3. Resource managers will re-initialised.
267+
268+
4. Update in Node allocatable capacity.
269+
270+
5. Scheduler:
271+
* Scheduler will automatically schedule any pending pods.
272+
* This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
273+
available capacity of the node and creates pods accordingly.
274+
275+
276+
### Handling hotplug events
277+
278+
Once the capacity of the node is altered, the following are the sequence of events that occur in the kubelet. If any errors are
279+
observed in any of the steps, operation is retried from step 1 along with a `FailedNodeResize` event under the node object.
280+
1. Resizing existing containers:
281+
a.With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to
282+
the available memory on the host. This would lead to recalculation of oom_score_adj and swap_limits.
283+
b.This is achieved by invoking the CRI API - UpdateContainerResources.
284+
285+
2. Reinitialise Resource Manager:
286+
a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
287+
available capacities under the node.
288+
b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers.
289+
3. Updating the node allocatable resources:
290+
a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities,
291+
the scheduler proceeds to schedule any pending pods.
292+
293+
294+
#### Flow Control for updating swap limit for containers
295+
296+
Formula to calculate Swap Limit: `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
297+
```
298+
T=0: Node Resources:
299+
- Memory: 6G
300+
- Swap: 4G
301+
Pod:
302+
- container1
303+
- MemoryRequest: 2G
304+
- State: Running
305+
Runtime:
306+
- <cgroup_path>/memory.swap.max: 1.33G
307+
308+
T=1: Resize Instance to Hotplug Memory:
309+
- Memory: 8G
310+
- Swap: 4G
311+
Pod:
312+
- container1
313+
- MemoryRequest: 2G
314+
- State: Running
315+
Runtime:
316+
- <cgroup_path>/memory.swap.max: 1G
317+
```
318+
319+
Similar flow is applicable for updating oom_score_adj.
320+
321+
### Handling HotUnplug Events
322+
323+
Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
324+
For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
325+
326+
As the hot-unplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
327+
is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.
328+
329+
Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
330+
In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
331+
332+
#### Flow Control
333+
334+
```
335+
T=0: Node initial Resources:
336+
- Memory: 10G
337+
- Pod: Memory
338+
339+
T=1: Resize Instance to Hotplug Memory
340+
- Current Memory: 10G
341+
- Update Memory: 15G
342+
- Node state: Ready
343+
344+
T=2: Resize Instance to HotUnplug Memory
345+
- Current Memory: 15G
346+
- UpdatedMemory: 5G
347+
- Node state: NotReady
348+
349+
T=3: Resize Instance to Hotplug Memory
350+
- Current Memory: 5G
351+
- Updated Memory Size: 15G
352+
- Node state: Ready
353+
```
354+
355+
Few of the concerns surrounding hotunplug are listed below
356+
* Pod re-admission:
357+
* Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
358+
or if it has to be terminated due to resource crunch.
359+
* Recalculate OOM adjust score and Swap limits:
360+
* Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
361+
* Handling unplug of reserved CPUs.
362+
363+
we intend to propose a separate KEP dedicated to hotunplug of resources to address the same.
364+
303365
**Proposed Code changes**
304366

305367
**Pseudocode for Resource Hotplug**
306368

307369
```go
308-
if utilfeature.DefaultFeatureGate.Enabled(features.NodeResourceHotPlug) {
309-
// Handle the node dynamic scale up
310-
machineInfo, err := kl.cadvisor.MachineInfo()
311-
if err != nil {
312-
klog.ErrorS(err, "Error fetching machine info")
313-
} else {
314-
cachedMachineInfo, _ := kl.GetCachedMachineInfo()
315-
// Avoid collector collects it as a timestamped metric
316-
// See PR #95210 and #97006 for more details.
317-
machineInfo.Timestamp = time.Time{}
318-
if !reflect.DeepEqual(cachedMachineInfo, machineInfo) {
319-
kl.setCachedMachineInfo(machineInfo)
320-
321-
// Resync the resource managers
322-
if err := kl.containerManager.ResyncComponents(machineInfo); err != nil {
323-
klog.ErrorS(err, "Error resyncing the kubelet components with machine info")
324-
}
325-
326-
// Recalculate OOMScoreAdj and Swap Limit.
327-
// NOTE: we will make use UpdateContainerResources CRI method to update the values.
328-
if err := kl.RecalculateOOMScoreAdjAndSwap(); err != nil {
329-
klog.ErrorS(err, "Error recalculating OOMScoreAdj and Swap")
330-
}
331-
332-
}
333-
}
334-
}
370+
func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
371+
syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
372+
.
373+
.
374+
case machineInfo := <-kl.nodeResourceManager.MachineInfo():
375+
// Resize the containers.
376+
klog.InfoS("Resizing containers due to change in MachineInfo")
377+
if err := resizeContainers(); err != nil {
378+
klog.ErrorS(err, "Failed to resize containers with change in machine info")
379+
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error())
380+
break
381+
}
382+
383+
// Resync the resource managers.
384+
klog.InfoS("ResyncComponents resource managers because of change in MachineInfo")
385+
if err := kl.containerManager.ResyncComponents(machineInfo); err != nil {
386+
klog.ErrorS(err, "Failed to resync resource managers with machine info update")
387+
kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error())
388+
break
389+
}
390+
391+
// Update the cached MachineInfo.
392+
kl.setCachedMachineInfo(machineInfo)
393+
.
394+
.
395+
}
335396
```
336397

337-
**Changes to resource managers to adapt to dynamic scale up of resources**
398+
**Changes to resource managers to adapt to hotplug of resources**
338399

339400
1. Adding ResyncComponents() method to ContainerManager interface
340401
```go
@@ -369,7 +430,7 @@ to implement this enhancement.
369430
2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow.
370431
3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change.
371432
4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change.
372-
433+
5. Add necessary tests to validate change in oom_score and swap limit for containers post resize.
373434
374435
##### e2e tests
375436
@@ -385,8 +446,7 @@ Following scenarios need to be covered:
385446
### Graduation Criteria
386447
387448
388-
#### Phase 1: Alpha (target 1.33)
389-
449+
#### Phase 1: Alpha (target 1.34)
390450
391451
* Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the `NodeResourceHotPlug`
392452
feature gate.
@@ -835,7 +895,8 @@ nodes capacity may be lower than existing workloads memory requirement.
835895
836896
## Alternatives
837897
838-
Horizontally scale the cluster by incorporating additional compute nodes.
898+
* Horizontally scale the cluster by incorporating additional compute nodes.
899+
* Use fake placeholder resources that are available but not enabled (e.g., balloon drivers)
839900
840901
<!--
841902
What other approaches did you consider, and why did you rule them out? These do
@@ -860,3 +921,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
860921
* At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.
861922
* Presently, resource managers are updated through regular polling. Once the CRI APIs are enhanced to fetch machine information, we can significantly enhance the reinitialization of resource managers,
862923
enabling them to respond more effectively to resize events.
924+
925+
* Knobs to alter Kube and System reserved
926+
* Currently, these values are calculated and set by individual cloud providers or vendors.
927+
* This can be further explored to enable options to set the kube and system reserved capacities as tunables.

keps/sig-node/3953-node-resource-hot-plug/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ stage: "alpha"
2929
# The most recent milestone for which work toward delivery of this KEP has been
3030
# done. This can be the current (upcoming) milestone, if it is being actively
3131
# worked on.
32-
latest-milestone: "v1.33"
32+
latest-milestone: "v1.34"
3333

3434
# The milestone at which this feature was, or is targeted to be, at each stage.
3535
milestone:

0 commit comments

Comments
 (0)