Allow nvcdi feature flags to be set for the `jit-cdi` mode #1419

elezar · 2025-11-04T16:04:06Z

This changs allows nvcdi feature flags to be set for the (now default) jit-cdi mode. This can be used as a workaround for issues such as #1398 where there are issues with a specific driver version.

With the change from NVIDIA/k8s-device-plugin#1495 it will also be possible to set this in the device plugin.

To opt-in to this feature for the jit-cdi mode, run:

sudo nvidia-ctk config --in-place --set nvidia-container-runtime.modes.jit-cdi.nvcdi-feature-flags=enable-nvsandboxutils

To opt-in for nvidia-ctk cdi generate run:

nvidia-ctk cdi generate --feature-flag=enable-nvsandboxutils

To opt-in for the nvidia-cdi-refresh.service add:

NVIDIA_CTK_CDI_GENERATE_FEATURE_FLAGS=enable-nvsandboxutils

to /etc/nvidia-container-toolkit/nvidia-cdi-refresh.env.

elezar · 2025-11-04T16:04:29Z

/cherry-pick release-1.18

ArangoGutierrez · 2025-11-04T16:42:45Z

pkg/nvcdi/api.go

 	// FeatureDisableNvsandboxUtils disables the use of nvsandboxutils when
 	// querying devices.
+	//
+	// Deprecated: nvsandboxutils is now disabled by default. To opt-in use the


Do we have a policy on how long to keep a deprecated feature before total removal?

No we don't.

For this feature specifically, I know that @jgehrcke mentioned using this while working on the dynamic MIG for DRA.

cdesiniotis · 2025-11-06T22:12:52Z

Just a general question -- what is nvsandboxutils used for currently? As in, what parts of the CDI spec does it help us generate?

elezar · 2025-11-07T09:10:20Z

Just a general question -- what is nvsandboxutils used for currently? As in, what parts of the CDI spec does it help us generate?

It's functionality depends on driver version. Currently we use it here for device node symlinks

nvidia-container-toolkit/internal/platform-support/dgpu/nvsandboxutils.go

Lines 66 to 107 in 754cfa7

    
           func (d *nvsandboxutilsDGPU) Devices() ([]discover.Device, error) { 
        
           	gpuFileInfos, ret := d.lib.GetGpuResource(d.uuid) 
        
           	if ret != nvsandboxutils.SUCCESS { 
        
           		return nil, fmt.Errorf("failed to get GPU resource: %w", ret) 
        
           	} 
        
           	var devices []discover.Device 
        
           	for _, info := range gpuFileInfos { 
        
           		switch info.SubType { 
        
           		case nvsandboxutils.NV_DEV_DRI_CARD, nvsandboxutils.NV_DEV_DRI_RENDERD: 
        
           			if d.isMig { 
        
           				continue 
        
           			} 
        
           			fallthrough 
        
           		case nvsandboxutils.NV_DEV_NVIDIA, nvsandboxutils.NV_DEV_NVIDIA_CAPS_NVIDIA_CAP: 
        
           			containerPath := info.Path 
        
           			if d.devRoot != "/" { 
        
           				containerPath = strings.TrimPrefix(containerPath, d.devRoot) 
        
           			} 
        
           			// TODO: Extend discover.Device with additional information. 
        
           			device := discover.Device{ 
        
           				HostPath: info.Path, 
        
           				Path:     containerPath, 
        
           			} 
        
           			devices = append(devices, device) 
        
           		case nvsandboxutils.NV_DEV_DRI_CARD_SYMLINK, nvsandboxutils.NV_DEV_DRI_RENDERD_SYMLINK: 
        
           			if d.isMig { 
        
           				continue 
        
           			} 
        
           			if info.Flags == nvsandboxutils.NV_FILE_FLAG_CONTENT { 
        
           				targetPath, ret := d.lib.GetFileContent(info.Path) 
        
           				if ret != nvsandboxutils.SUCCESS { 
        
           					return nil, fmt.Errorf("failed to get symlink: %w", ret) 
        
           				} 
        
           				d.deviceLinks = append(d.deviceLinks, fmt.Sprintf("%v::%v", targetPath, info.Path)) 
        
           			} 
        
           		} 
        
           	} 
        
           	return devices, nil 
        
           }

#825 aimed to refactor the driver discovery so that we could add additional functionality, but it has not made it in.

@cdesiniotis thinking about this again, do we want to have opt-in or opt-out behaviour here. The issues that users are experiencing were limited to the 565 driver branch, and we may not want to roll back things now and lose the experience that we would gain by having this on by default.

This change allows users the posibility to explicitly specify feature flags for using the `jit-cdi` mode. This allows, for example, for users to opt-in to use nvsandboxutils in the default mode in addition to when generating CDI specs explicitly. Signed-off-by: Evan Lezar <elezar@nvidia.com>

This original implementation of the FeatureDisableNvsandboxUtils feature flag included a dash that was not supposed to be there. This change updates the feature flag's string representation to disable-nvsandboxutils. Special handling is included for users that may still use the old string value (e.g. for the nvidia-ctk cdi generate command), but no changes are expected for users of the nvcdi API. Signed-off-by: Evan Lezar <elezar@nvidia.com>

cdesiniotis · 2025-11-07T17:26:02Z

The same question crossed my mind when initially reviewing this. Do we know what driver branches / versions besides 565 are affected? 565 is a short-lived branch and is no longer supported, see https://docs.nvidia.com/datacenter/tesla/drivers/supported-drivers-and-cuda-toolkit-versions.html. If the fix is present in all of the active branches (535 / 570 / 580), then I would definitely be in favor of continuing to use nvsandboxutils by default and providing opt-out behavior.

elezar · 2025-11-10T14:22:40Z

If the fix is present in all of the active branches (535 / 570 / 580), then I would definitely be in favor of continuing to use nvsandboxutils by default and providing opt-out behavior.

If I recall correctly, nvsandboxutils was only added with a driver version after 535 meaning that this bug is not applicable there. The bug that was triggered in 565 was already fixed in 570 at the point where we discovered it. The 580 branch also includes this fix.

github-actions · 2025-11-10T14:23:00Z

🤖 Backport PR created for release-1.18: #1443 ✅

elezar added this to the v1.18.1 milestone Nov 4, 2025

github-actions bot added the cherry-pick/release-1.18 label Nov 4, 2025

elezar requested review from ArangoGutierrez, cdesiniotis, jgehrcke and tariq1890 November 4, 2025 16:07

elezar self-assigned this Nov 4, 2025

elezar mentioned this pull request Nov 4, 2025

Update NVIDIA Container Toolkit to v1.18.1 NVIDIA/k8s-device-plugin#1485

Open

elezar force-pushed the make-nvsandbox-utils-opt-in branch from b025a99 to 4ea22c5 Compare November 4, 2025 16:18

elezar changed the title ~~Make nvsandbox utils opt in~~ Disable nvsandbox utils by default Nov 4, 2025

ArangoGutierrez reviewed Nov 4, 2025

View reviewed changes

ArangoGutierrez approved these changes Nov 5, 2025

View reviewed changes

elezar changed the title ~~Disable nvsandbox utils by default~~ Disable nvsandboxutils by default Nov 5, 2025

elezar added 2 commits November 7, 2025 10:16

elezar force-pushed the make-nvsandbox-utils-opt-in branch from 4ea22c5 to a37bab4 Compare November 7, 2025 09:23

elezar changed the title ~~Disable nvsandboxutils by default~~ Allow nvcdi feature flags to be set for the jit-cdi mode Nov 7, 2025

elezar requested a review from ArangoGutierrez November 7, 2025 09:25

elezar merged commit c171e65 into NVIDIA:main Nov 10, 2025
13 checks passed

github-actions bot mentioned this pull request Nov 10, 2025

[release-1.18] Allow nvcdi feature flags to be set for the jit-cdi mode #1443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow nvcdi feature flags to be set for the `jit-cdi` mode #1419

Allow nvcdi feature flags to be set for the `jit-cdi` mode #1419

Uh oh!

elezar commented Nov 4, 2025 •

edited

Loading

Uh oh!

elezar commented Nov 4, 2025

Uh oh!

ArangoGutierrez Nov 4, 2025

Uh oh!

elezar Nov 5, 2025

Uh oh!

cdesiniotis commented Nov 6, 2025

Uh oh!

elezar commented Nov 7, 2025

Uh oh!

cdesiniotis commented Nov 7, 2025

Uh oh!

elezar commented Nov 10, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow nvcdi feature flags to be set for the jit-cdi mode #1419

Allow nvcdi feature flags to be set for the jit-cdi mode #1419

Uh oh!

Conversation

elezar commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elezar commented Nov 4, 2025

Uh oh!

ArangoGutierrez Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

elezar Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

cdesiniotis commented Nov 6, 2025

Uh oh!

elezar commented Nov 7, 2025

Uh oh!

cdesiniotis commented Nov 7, 2025

Uh oh!

elezar commented Nov 10, 2025

Uh oh!

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow nvcdi feature flags to be set for the `jit-cdi` mode #1419

Allow nvcdi feature flags to be set for the `jit-cdi` mode #1419

elezar commented Nov 4, 2025 •

edited

Loading