Skip to content

Commit 29835a8

Browse files
committed
Update OOMScoreAdj formula
1 parent 579af1b commit 29835a8

File tree

1 file changed

+24
-25
lines changed
  • keps/sig-node/3953-node-resource-hot-plug

1 file changed

+24
-25
lines changed

keps/sig-node/3953-node-resource-hot-plug/README.md

Lines changed: 24 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ tags, and then generate with `hack/update-toc.sh`.
2929
- [Design Details](#design-details)
3030
- [Handling hotplug events](#handling-hotplug-events)
3131
- [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
32-
- [Compatability with Cluster Autoscaler](#compatability-with-cluster-autoscaler)
32+
- [Compatibility with Cluster Autoscaler](#compatibility-with-cluster-autoscaler)
3333
- [Handling HotUnplug Events](#handling-hotunplug-events)
3434
- [Flow Control](#flow-control)
3535
- [Test Plan](#test-plan)
@@ -138,7 +138,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
138138

139139
* Achieve seamless node capacity expansion through resource hotplug.
140140
* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation.
141-
* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
141+
* Recalculating and updating swap memory limit for existing pods.
142142

143143
### Non-Goals
144144

@@ -187,19 +187,24 @@ detect the change in compute capacity, which can bring in additional complicatio
187187

188188
### Risks and Mitigations
189189

190-
- Change in OOMScoreAdjust value:
191-
- The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
192-
- With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
193-
actual OOMScoreAdjust for existing pods.
194-
- This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for
195-
recalculating the scores.
196190
- Change in Swap limit:
197191
- The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
198192
- With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the
199193
actual swap limit for existing pods.
200194
- This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for
201195
recalculating the scores.
202196

197+
- Change in OOMScoreAdjust value:
198+
- The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
199+
- With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
200+
actual OOMScoreAdjust for existing pods.
201+
- Its not recommended to update the OOMScoreAdjust of a running container as OOMScoreAdjust value is set for init process(pid 1) which is
202+
responsible for running all other container's processes.
203+
- When we update OOMScoreAdjust for a running container, its set for container init only, and possibly processes which will be started later and
204+
running won't get the OOMScoreAdjust new value.
205+
- This can be mitigated by updating the OOMScoreAdj formula to not consider current memory value, hence the new OOMScoreAdj formula looks like this
206+
`min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)`
207+
203208
- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
204209
- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
205210

@@ -235,7 +240,7 @@ sequenceDiagram
235240
machine-info->>cAdvisor-cache: update
236241
cAdvisor-cache->>kubelet: update
237242
alt if increase in resource
238-
kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
243+
kubelet->>node: recalculate and update Swap limit of containers
239244
kubelet->>node: re-initialize resource managers
240245
kubelet->>node: node status update with new capacity
241246
else if decrease in resource
@@ -246,29 +251,24 @@ sequenceDiagram
246251
The interaction sequence is as follows:
247252
1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`.
248253
2. If the machine resource is increased:
249-
* Recalculate, update OOMScoreAdj and Swap limit of all the running containers.
254+
* Recalculate, update Swap limit of all the running containers.
250255
* Re-initialize resource managers.
251256
* Update node with new resource.
252257
3. If the machine resource is decreased.
253258
* Set node status to not ready. (This will be reverted when the current capacity exceeds or matches either the previous hot-plug capacity or the initial capacity
254259
in case there was no history of hotplug.)
255260

256261
With increase in cluster resources the following components will be updated:
257-
1. Change in OOM score adjust:
258-
* Currently, the OOM score adjust is calculated by
259-
`1000 - (1000*containerMemReq)/memoryCapacity`
260-
* Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
262+
1. Change in Swap Memory limit:
263+
* Currently, the swap memory limit is calculated by
264+
`(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
265+
* Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
261266

262-
2. Change in Swap Memory limit:
263-
* Currently, the swap memory limit is calculated by
264-
`(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
265-
* Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
267+
2. Resource managers are re-initialised.
266268

267-
3. Resource managers are re-initialised.
269+
3. Update in Node capacity.
268270

269-
4. Update in Node capacity.
270-
271-
5. Scheduler:
271+
4. Scheduler:
272272
* Scheduler will automatically schedule any pending pods.
273273
* This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
274274
available capacity of the node and creates pods accordingly.
@@ -287,6 +287,7 @@ observed in any of the steps, operation is retried from step 1 along with a `Fai
287287
a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
288288
available capacities under the node.
289289
b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers.
290+
290291
3. Updating the node allocatable resources:
291292
a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities,
292293
the scheduler proceeds to schedule any pending pods.
@@ -317,9 +318,7 @@ T=1: Resize Instance to Hotplug Memory:
317318
- <cgroup_path>/memory.swap.max: 1G
318319
```
319320

320-
Similar flow is applicable for updating oom_score_adj.
321-
322-
#### Compatability with Cluster Autoscaler
321+
#### Compatibility with Cluster Autoscaler
323322

324323
The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for
325324
newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true.

0 commit comments

Comments
 (0)