-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Add csinode limit awareness in cluster-autoscaler #8721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c6d08e7 to
c26409b
Compare
|
@mtrqq Could you do a first-pass review here? This is supposed to follow the way that DRA integration was implemented in CA as much as possible, but instead of the DRA objects we have |
10b6319 to
18906a8
Compare
48c0b28 to
2a287f3
Compare
|
@mtrqq yes, the PR is ready for review. It was marked WIP, because I had to keep rebasing it and I had to make bunch of changes because of latest kube rebase. Even the tests that are still failing are most likely unrelated to this change and are happening because of kube version bump. |
2a287f3 to
2811f12
Compare
2811f12 to
e673a9b
Compare
e673a9b to
dd1da1e
Compare
| return nil, err | ||
| } | ||
|
|
||
| wrappedNodeInfo := framework.WrapSchedulerNodeInfo(schedNodeInfo, nil, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me sometime to understand that you've changed the approach from wrapping the node info object inside the CSI/DRA snapshots to mutating the node info object inplace, this somewhat aligns with the approach to reduce the memory allocations when handling node infos, I like this part.
But can we come up with a consistent name for this method across snapshots? For example - AugmentNodeInfo
| "k8s.io/klog/v2" | ||
| fwk "k8s.io/kube-scheduler/framework" | ||
| schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework" | ||
| intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, they changed some of the interfaces in kube-scheduler and I renamed this import because it felt more consistent.
But, Please ignore these renaming for now, because once I rebase my PR with #8827 then these renamings will be gone.
| "k8s.io/klog/v2" | ||
| fwk "k8s.io/kube-scheduler/framework" | ||
| schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework" | ||
| intreeschedulerframework "k8s.io/kubernetes/pkg/scheduler/framework" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the import name here? The previous name is consistent across the other parts of the codebase and I don't see how the new name is better
| } | ||
|
|
||
| // AddCSINodes adds a list of CSI nodes to the snapshot. | ||
| func (s *Snapshot) AddCSINodes(csiNodes []*storagev1.CSINode) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse AddCSINode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_nodegroup_test.go
Show resolved
Hide resolved
cluster-autoscaler/go.mod
Outdated
| replace github.com/rancher/go-rancher => github.com/rancher/go-rancher v0.1.0 | ||
|
|
||
| replace k8s.io/api => k8s.io/api v0.34.1 | ||
| replace k8s.io/api => github.com/kubernetes/api v0.0.0-20251107002836-f1737241c064 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind adding github.com to all k8s dependencies?
| // if cloudprovider does not provide CSI related stuff, then we can skip the CSI readiness check | ||
| if nodeInfo.CSINode == nil { | ||
| newReadyNodes = append(newReadyNodes, node) | ||
| klog.Warningf("No CSI node found for node %s, Skipping CSI readiness check and keeping node in ready list.", node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning level seems to be extreme due to potential noise level in the logs, or do we anticipate all the nodes to have the matching CsiNode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all nodes should have a matching CSINode object. Without a CSINode object, kubelet will not event report node as ready.
dd1da1e to
e9bbe15
Compare
|
@gnufied: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
e9bbe15 to
9a64495
Compare
Fix code to use new framework
7300575 to
64598e8
Compare
| var options []expander.Option | ||
|
|
||
| // This code here runs a simulation to see which pods can be scheduled on which node groups. | ||
| // TODO: Fix bug with CSI node not being added to the simulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create an issue for this and paste the issue link in here if we plan to address this as a follow-up item?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this was already resolved. It is yet another artifact of leftover stuff. The whole mechanism doesn't work if CSINode is not in snapshot
64598e8 to
04586b9
Compare
| if aErr != nil { | ||
| return status.UpdateScaleUpError(&status.ScaleUpStatus{}, aErr.AddPrefix("could not get upcoming nodes: ")) | ||
| } | ||
| klog.V(4).Infof("Upcoming %d nodes", len(upcomingNodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible there are users who derive value from this and are running at v=4, can we restore it?
xref - kubernetes/enhancements#5030
This makes CAS aware of volume limits on new nodes when scaling for pending pods.
I have tested the proposed changes and they work fine. I have tested and verified for both scaling from 0 and scaling from existing nodes scenario and both work.
For example, given a AWS Openshift cluster after these changes, I could observe that it spins correct number of machines that count volumes too. To test this, I created a single pod that consumes 20 volumes. Usually 5 such pods can fit on a node easily, but because of number of volumes - we must spin 5 worker nodes at minimum. Given scaling from 0 scenario and the fact is - I already had 3 workers, I could see autoscaler created right number of workers from get-go:
I have tested this with higher number of nodes too, so we know it works as expected. Before my change, CAS will create just 1 node to accommodate such pod, because it thinks all 5 pods will fit on one node.