Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Nov 4, 2025

This changs allows nvcdi feature flags to be set for the (now default) jit-cdi mode. This can be used as a workaround for issues such as #1398 where there are issues with a specific driver version.

With the change from NVIDIA/k8s-device-plugin#1495 it will also be possible to set this in the device plugin.

To opt-in to this feature for the jit-cdi mode, run:

sudo nvidia-ctk config --in-place --set nvidia-container-runtime.modes.jit-cdi.nvcdi-feature-flags=enable-nvsandboxutils 

To opt-in for nvidia-ctk cdi generate run:

nvidia-ctk cdi generate --feature-flag=enable-nvsandboxutils

To opt-in for the nvidia-cdi-refresh.service add:

NVIDIA_CTK_CDI_GENERATE_FEATURE_FLAGS=enable-nvsandboxutils

to /etc/nvidia-container-toolkit/nvidia-cdi-refresh.env.

@elezar elezar added this to the v1.18.1 milestone Nov 4, 2025
@elezar
Copy link
Member Author

elezar commented Nov 4, 2025

/cherry-pick release-1.18

@elezar elezar self-assigned this Nov 4, 2025
@elezar elezar force-pushed the make-nvsandbox-utils-opt-in branch from b025a99 to 4ea22c5 Compare November 4, 2025 16:18
@elezar elezar changed the title Make nvsandbox utils opt in Disable nvsandbox utils by default Nov 4, 2025
pkg/nvcdi/api.go Outdated
// FeatureDisableNvsandboxUtils disables the use of nvsandboxutils when
// querying devices.
//
// Deprecated: nvsandboxutils is now disabled by default. To opt-in use the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a policy on how long to keep a deprecated feature before total removal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't.

For this feature specifically, I know that @jgehrcke mentioned using this while working on the dynamic MIG for DRA.

@elezar elezar changed the title Disable nvsandbox utils by default Disable nvsandboxutils by default Nov 5, 2025
@cdesiniotis
Copy link
Contributor

Just a general question -- what is nvsandboxutils used for currently? As in, what parts of the CDI spec does it help us generate?

@elezar
Copy link
Member Author

elezar commented Nov 7, 2025

Just a general question -- what is nvsandboxutils used for currently? As in, what parts of the CDI spec does it help us generate?

It's functionality depends on driver version. Currently we use it here for device node symlinks

func (d *nvsandboxutilsDGPU) Devices() ([]discover.Device, error) {
gpuFileInfos, ret := d.lib.GetGpuResource(d.uuid)
if ret != nvsandboxutils.SUCCESS {
return nil, fmt.Errorf("failed to get GPU resource: %w", ret)
}
var devices []discover.Device
for _, info := range gpuFileInfos {
switch info.SubType {
case nvsandboxutils.NV_DEV_DRI_CARD, nvsandboxutils.NV_DEV_DRI_RENDERD:
if d.isMig {
continue
}
fallthrough
case nvsandboxutils.NV_DEV_NVIDIA, nvsandboxutils.NV_DEV_NVIDIA_CAPS_NVIDIA_CAP:
containerPath := info.Path
if d.devRoot != "/" {
containerPath = strings.TrimPrefix(containerPath, d.devRoot)
}
// TODO: Extend discover.Device with additional information.
device := discover.Device{
HostPath: info.Path,
Path: containerPath,
}
devices = append(devices, device)
case nvsandboxutils.NV_DEV_DRI_CARD_SYMLINK, nvsandboxutils.NV_DEV_DRI_RENDERD_SYMLINK:
if d.isMig {
continue
}
if info.Flags == nvsandboxutils.NV_FILE_FLAG_CONTENT {
targetPath, ret := d.lib.GetFileContent(info.Path)
if ret != nvsandboxutils.SUCCESS {
return nil, fmt.Errorf("failed to get symlink: %w", ret)
}
d.deviceLinks = append(d.deviceLinks, fmt.Sprintf("%v::%v", targetPath, info.Path))
}
}
}
return devices, nil
}

#825 aimed to refactor the driver discovery so that we could add additional functionality, but it has not made it in.

@cdesiniotis thinking about this again, do we want to have opt-in or opt-out behaviour here. The issues that users are experiencing were limited to the 565 driver branch, and we may not want to roll back things now and lose the experience that we would gain by having this on by default.

This change allows users the posibility to explicitly specify
feature flags for using the `jit-cdi` mode. This allows, for
example, for users to opt-in to use nvsandboxutils in the
default mode in addition to when generating CDI specs
explicitly.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This original implementation of the FeatureDisableNvsandboxUtils feature
flag included a dash that was not supposed to be there. This change
updates the feature flag's string representation to disable-nvsandboxutils.

Special handling is included for users that may still use the old string
value (e.g. for the nvidia-ctk cdi generate command), but no changes are
expected for users of the nvcdi API.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the make-nvsandbox-utils-opt-in branch from 4ea22c5 to a37bab4 Compare November 7, 2025 09:23
@elezar elezar changed the title Disable nvsandboxutils by default Allow nvcdi feature flags to be set for the jit-cdi mode Nov 7, 2025
@cdesiniotis
Copy link
Contributor

The same question crossed my mind when initially reviewing this. Do we know what driver branches / versions besides 565 are affected? 565 is a short-lived branch and is no longer supported, see https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html. If the fix is present in all of the active branches (535 / 570 / 580), then I would definitely be in favor of continuing to use nvsandboxutils by default and providing opt-out behavior.

@elezar
Copy link
Member Author

elezar commented Nov 10, 2025

If the fix is present in all of the active branches (535 / 570 / 580), then I would definitely be in favor of continuing to use nvsandboxutils by default and providing opt-out behavior.

If I recall correctly, nvsandboxutils was only added with a driver version after 535 meaning that this bug is not applicable there. The bug that was triggered in 565 was already fixed in 570 at the point where we discovered it. The 580 branch also includes this fix.

@elezar elezar merged commit c171e65 into NVIDIA:main Nov 10, 2025
13 checks passed
@github-actions
Copy link

🤖 Backport PR created for release-1.18: #1443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants