-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5578: Node Resource Hot-Unplug #5585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Karthik-K-N
commented
Sep 30, 2025
- One-line PR description: Node Resource Hot-Unplug
- Issue link: Node Resource Hot-Unplug #5578
- Other comments:
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Karthik-K-N The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
This KEP is follow up on Node Resource Hot Plug . |
|
/cc |
e50a22f to
b9859be
Compare
| ## Alternatives | ||
|
|
||
| Scale down the cluster by removing compute nodes. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's also an option to uninstall the kubelet, delete the associated Node, and re-register?
Not a good idea, but it is an alternative.
| ### Troubleshooting | ||
|
|
||
| ###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hot unplug will trigger writes to Node .status, I imagine.
|
|
||
| Monitor the metrics | ||
|
|
||
| - node_hot_unplug_request_total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these have dimensions / labels (eg to distinguish CPU unplugs from memory)?
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: | ||
| # none, this is in the e2e test framework | ||
| - name: NodeResourceHotUnPlug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd use a lowercase p.
| ### Non-Goals | ||
|
|
||
| * Dynamically adjust system reserved and kube reserved values. | ||
| * Update the autoscaler to utilize resource hotplug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we have more than one autoscaler (actually about seven)
|
|
||
| Downgrade | ||
|
|
||
| It's always possible to trivially downgrade to the previous kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have hot unplugged half my CPUs and then downgrade the kubelet, what happens? If it's all fine, why is it all fine?
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| Not relevant, As this kubelet specific feature and does not impact other components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't a hot unplug result in a write to the Node? I think people will expect that the .status reflects the Node's resource availability.
|
|
||
| ###### How can an operator determine if the feature is in use by workloads? | ||
|
|
||
| This feature will be built into kubelet and behind a feature gate. Examining the kubelet feature gate would help in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the kubernetes_feature_enabled metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can mention flagz.
|
|
||
| ###### Will enabling / using this feature result in any new API calls? | ||
|
|
||
| No, It won't add/modify any user facing APIs. Internally kubelet runs the pod-readmission. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should result in writes to the status subresource of the associated Node.
Without doing that, the scheduler(s) will make incorrect placement decisions.
| ## Motivation | ||
|
|
||
| Node Resource Hotplug provides the ability to increase the resources of a cluster on demand without any downtime during a surge of resource usages by workloads | ||
| Now with Node resource hot-unplug the motive is to remove the resources when not needed for cost optimisation without any downtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this purely about cost optimization?
I think this could be used to respond to hardware degradation too, or even as a way for a bare-metal node to prepare for an in place hardware upgrade.