KEP-1287: Allow memory limit decreases

tallclair · tallclair · commit c7b930494480 · 2025-06-09T16:24:05.000-07:00
diff --git a/keps/sig-node/1287-in-place-update-pod-resources/README.md b/keps/sig-node/1287-in-place-update-pod-resources/README.md
@@ -390,9 +390,6 @@ WindowsPodSandboxConfig.
    configurations will see values that may not represent actual configurations. As a
    mitigation, this change needs to be documented and highlighted in the
    release notes, and in top-level Kubernetes documents.
-1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
-   could be in use, and approaches such as setting limit near current usage may
-   be required. This issue needs further investigation.
 1. Scheduler race condition: If a resize happens concurrently with the scheduler evaluating the node
    where the pod is resized, it can result in a node being over-scheduled, which will cause the pod
    to be rejected with an `OutOfCPU` or `OutOfMemory` error. Solving this race condition is out of
@@ -810,11 +807,17 @@ Setting the memory limit below current memory usage can cause problems. If the k
 sufficient memory, the outcome depends on the cgroups version. With cgroups v1 the change will
 simply be rejected by the kernel, whereas with cgroups v2 it will trigger an oom-kill.
 
-In the initial beta release of in-place resize, we will **disallow** `PreferNoRestart` memory limit
-decreases, enforced through API validation. The intent is for this restriction to be relaxed in the
-future, but the design of how limit decreases will be approached is still undecided.
+If the memory resize restart policy is `NotRequired` (or unspecified), the Kubelet will make a
+**best-effort** attempt to prevent oom-kills when decreasing memory limits, but doesn't provide any
+guarantees. Before decreasing container memory limits, the Kubelet will read the container memory
+usage (via the StatsProvider). If usage is greater than the desired limit, the resize will be
+skipped for that container. The pod condition `PodResizeInProgress` will remain, with an `Error`
+reason, and a message reporting the current usage & desired limit. This is considered best-effort
+since it is still subject to a time-of-check-time-of-use (TOCTOU) race condition where the usage exceeds the limit after the
+check is performed. A similar check will also be performed at the pod level before lowering the pod
+cgroup memory limit.
 
-Memory limit decreases with `RestartRequired` are still allowed.
+_Version skew note:_ Kubernetes v1.33 (and earlier) nodes only check the pod-level memory usage.
 
 ### Swap
 
@@ -891,7 +894,8 @@ This will be reconsidered post-beta as a future enhancement.
 
 ### Future Enhancements
 
-1. Allow memory limits to be decreased, and handle the case where limits are set below usage.
+1. Improve memory limit decrease oom-kill prevention by leveraging other kernel mechanisms or using
+   gradual decreaese.
 1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for
    resize. Pre-emption by Kubelet may be simpler and offer lower latencies.
 1. Allow ResizePolicy to be set on Pod level, acting as default if (some of)
@@ -1546,6 +1550,8 @@ _This section must be completed when targeting beta graduation to a release._
       and update CRI `UpdateContainerResources` contract
     - Add back `AllocatedResources` field to resolve a scheduler corner case
     - Introduce Actuated resources for actuation
+- 2025-06-03 - v1.34 post-beta updates
+    - Allow no-restart memory limit decreases
 
 ## Drawbacks