-
Notifications
You must be signed in to change notification settings - Fork 434
Allow nvcdi feature flags to be set for the jit-cdi mode
#1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cherry-pick release-1.18 |
b025a99 to
4ea22c5
Compare
pkg/nvcdi/api.go
Outdated
| // FeatureDisableNvsandboxUtils disables the use of nvsandboxutils when | ||
| // querying devices. | ||
| // | ||
| // Deprecated: nvsandboxutils is now disabled by default. To opt-in use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a policy on how long to keep a deprecated feature before total removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't.
For this feature specifically, I know that @jgehrcke mentioned using this while working on the dynamic MIG for DRA.
|
Just a general question -- what is nvsandboxutils used for currently? As in, what parts of the CDI spec does it help us generate? |
It's functionality depends on driver version. Currently we use it here for device node symlinks nvidia-container-toolkit/internal/platform-support/dgpu/nvsandboxutils.go Lines 66 to 107 in 754cfa7
#825 aimed to refactor the driver discovery so that we could add additional functionality, but it has not made it in. @cdesiniotis thinking about this again, do we want to have opt-in or opt-out behaviour here. The issues that users are experiencing were limited to the 565 driver branch, and we may not want to roll back things now and lose the experience that we would gain by having this on by default. |
This change allows users the posibility to explicitly specify feature flags for using the `jit-cdi` mode. This allows, for example, for users to opt-in to use nvsandboxutils in the default mode in addition to when generating CDI specs explicitly. Signed-off-by: Evan Lezar <elezar@nvidia.com>
This original implementation of the FeatureDisableNvsandboxUtils feature flag included a dash that was not supposed to be there. This change updates the feature flag's string representation to disable-nvsandboxutils. Special handling is included for users that may still use the old string value (e.g. for the nvidia-ctk cdi generate command), but no changes are expected for users of the nvcdi API. Signed-off-by: Evan Lezar <elezar@nvidia.com>
4ea22c5 to
a37bab4
Compare
jit-cdi mode
|
The same question crossed my mind when initially reviewing this. Do we know what driver branches / versions besides 565 are affected? 565 is a short-lived branch and is no longer supported, see https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html. If the fix is present in all of the active branches (535 / 570 / 580), then I would definitely be in favor of continuing to use |
If I recall correctly, |
|
🤖 Backport PR created for |
This changs allows nvcdi feature flags to be set for the (now default)
jit-cdimode. This can be used as a workaround for issues such as #1398 where there are issues with a specific driver version.With the change from NVIDIA/k8s-device-plugin#1495 it will also be possible to set this in the device plugin.
To opt-in to this feature for the
jit-cdimode, run:To opt-in for
nvidia-ctk cdi generaterun:To opt-in for the
nvidia-cdi-refresh.serviceadd:to
/etc/nvidia-container-toolkit/nvidia-cdi-refresh.env.