You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -138,7 +138,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
138
138
139
139
* Achieve seamless node capacity expansion through resource hotplug.
140
140
* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation.
141
-
* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
141
+
* Recalculating and updating swap memory limit for existing pods.
142
142
143
143
### Non-Goals
144
144
@@ -187,19 +187,24 @@ detect the change in compute capacity, which can bring in additional complicatio
187
187
188
188
### Risks and Mitigations
189
189
190
-
- Change in OOMScoreAdjust value:
191
-
- The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
192
-
- With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
193
-
actual OOMScoreAdjust for existing pods.
194
-
- This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for
195
-
recalculating the scores.
196
190
- Change in Swap limit:
197
191
- The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
198
192
- With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the
199
193
actual swap limit for existing pods.
200
194
- This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for
201
195
recalculating the scores.
202
196
197
+
- Change in OOMScoreAdjust value:
198
+
- The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
199
+
- With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
200
+
actual OOMScoreAdjust for existing pods.
201
+
- Its not recommended to update the OOMScoreAdjust of a running container as OOMScoreAdjust value is set for init process(pid 1) which is
202
+
responsible for running all other container's processes.
203
+
- When we update OOMScoreAdjust for a running container, its set for container init only, and possibly processes which will be started later and
204
+
running won't get the OOMScoreAdjust new value.
205
+
- This can be mitigated by updating the OOMScoreAdj formula to not consider current memory value, hence the new OOMScoreAdj formula looks like this
- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
204
209
- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
205
210
@@ -235,7 +240,7 @@ sequenceDiagram
235
240
machine-info->>cAdvisor-cache: update
236
241
cAdvisor-cache->>kubelet: update
237
242
alt if increase in resource
238
-
kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
243
+
kubelet->>node: recalculate and update Swap limit of containers
239
244
kubelet->>node: re-initialize resource managers
240
245
kubelet->>node: node status update with new capacity
241
246
else if decrease in resource
@@ -246,29 +251,24 @@ sequenceDiagram
246
251
The interaction sequence is as follows:
247
252
1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`.
248
253
2. If the machine resource is increased:
249
-
* Recalculate, update OOMScoreAdj and Swap limit of all the running containers.
254
+
* Recalculate, update Swap limit of all the running containers.
250
255
* Re-initialize resource managers.
251
256
* Update node with new resource.
252
257
3. If the machine resource is decreased.
253
258
* Set node status to not ready. (This will be reverted when the current capacity exceeds or matches either the previous hot-plug capacity or the initial capacity
254
259
in case there was no history of hotplug.)
255
260
256
261
With increase in cluster resources the following components will be updated:
257
-
1. Change in OOM score adjust:
258
-
* Currently, the OOM score adjust is calculated by
259
-
`1000 - (1000*containerMemReq)/memoryCapacity`
260
-
* Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
262
+
1. Change in Swap Memory limit:
263
+
* Currently, the swap memory limit is calculated by
* Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
261
266
262
-
2. Change in Swap Memory limit:
263
-
* Currently, the swap memory limit is calculated by
* Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
267
+
2. Resource managers are re-initialised.
266
268
267
-
3.Resource managers are re-initialised.
269
+
3.Update in Node capacity.
268
270
269
-
4. Update in Node capacity.
270
-
271
-
5. Scheduler:
271
+
4. Scheduler:
272
272
* Scheduler will automatically schedule any pending pods.
273
273
* This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
274
274
available capacity of the node and creates pods accordingly.
@@ -287,6 +287,7 @@ observed in any of the steps, operation is retried from step 1 along with a `Fai
287
287
a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
288
288
available capacities under the node.
289
289
b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers.
290
+
290
291
3. Updating the node allocatable resources:
291
292
a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities,
292
293
the scheduler proceeds to schedule any pending pods.
@@ -317,9 +318,7 @@ T=1: Resize Instance to Hotplug Memory:
317
318
- <cgroup_path>/memory.swap.max: 1G
318
319
```
319
320
320
-
Similar flow is applicable for updating oom_score_adj.
321
-
322
-
#### Compatability with Cluster Autoscaler
321
+
#### Compatibility with Cluster Autoscaler
323
322
324
323
The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for
325
324
newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true.
0 commit comments