Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Nov 18, 2025

This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.

@ArangoGutierrez ArangoGutierrez self-assigned this Nov 18, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the gtg branch 2 times, most recently from 541c6cb to 44d450f Compare November 18, 2025 18:13
@ArangoGutierrez ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45
}()

// Start recovery worker to detect when unhealthy devices become healthy
go plugin.runRecoveryWorker()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.

As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:

type DeviceHealthCheckProvider interface {
   Start(context.Context) error
   Stop()
   Health() <-channel Device

(alternatively one could split the Health channel into Healthy() and Unhealthy()).

Comment on lines 278 to 282
// If health provider not available, wait for context cancellation
if plugin.healthProvider == nil {
<-plugin.ctx.Done()
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

Comment on lines 30 to 40
// envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks
// should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are
// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
// this is in addition to the Application errors that are already ignored.
// envDisableHealthChecks defines the environment variable that is
// checked to determine whether healthchecks should be disabled. If
// this envvar is set to "all" or contains the string "xids",
// healthchecks are disabled entirely. If set, the envvar is treated
// as a comma-separated list of Xids to ignore. Note that this is in
// addition to the Application errors that are already ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now in a [no-relnote] commit

nvml: nvml,
config: config,
devices: devices,
healthChan: make(chan *Device, 64),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why 64?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes

Comment on lines 119 to 124
if p.started {
p.mu.Unlock()
return fmt.Errorf("health provider already started")
}
p.started = true
p.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not defer p.mu.Unlock() instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that "slowness" is something we should optimize for. Start() and Stop() should not be running concurrently. As implemented, because we release the lock before Start() has completed (and similarly for Stop()) the remaining code may end up overlapping which is not what we want.

Also note that we set started to true before the health monitor is actually ready and this is not reset in the envent of an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion adopted

wg sync.WaitGroup

// State guards
mu sync.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use an IsA relationship to simplify taking and releasing the lock:

Suggested change
mu sync.Mutex
sync.Mutex

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

ret := p.nvml.Init()
if ret != nvml.SUCCESS {
if *r.config.Flags.FailOnInitError {
if *p.config.Flags.FailOnInitError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's not rename r to p in a single commit. (see comment on managing diffs).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted , now an independent commit

Comment on lines 127 to 131
p.xidsDisabled = getDisabledHealthCheckXids()
if p.xidsDisabled.IsAllDisabled() {
klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS")
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion, adopted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you have pulled this up to the func (r *nvmlResourceManager) HealthProvider() HealthProvider { implementation. This means that we have to construct the list of disabled XIDs twice. Why not handle this (and other static config) in the NewNVMLHealthProvider constructor instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to NewNVMLHealthProvider now

klog.Warningf("NVML init failed: %v; health checks disabled", ret)
return nil
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the move away from a deferred shutdown?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All error paths after Init() have now individual clean up logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more error prone than it needs to be. Note that we may want to distinguish between acceptable errors and non acceptable ones. As was the case for the related k8s-dra-driver PR, we may want to call Init() in the constructor and handle these errors (with deferred Shutdown() separately from errors that are triggered during Start.

For example, what about updating the constructor to:

// NewNVMLHealthProvider creates a new health provider for NVML devices.
// Does not start monitoring - caller must call Start().
func newNVMLHealthProvider(nvmllib nvml.Interface, config *spec.Config, devices Devices) (HealthProvider, error) {
	xids := getDisabledHealthCheckXids()
	if xids.IsAllDisabled() {
		return &noopHealthProvider{}, nil
	}

	ret := nvmllib.Init()
	if ret != nvml.SUCCESS {
		if *config.Flags.FailOnInitError {
			return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
		}
		klog.Warningf("NVML init failed: %v; health checks disabled", ret)
		return &noopHealthProvider{}, nil
	}
	defer func() {
		ret := nvmllib.Shutdown()
		if ret != nvml.SUCCESS {
			klog.Infof("Error shutting down NVML: %v", ret)
		}
	}()

	klog.Infof("Ignoring the following XIDs for health checks: %v", xids)

	p := &nvmlHealthProvider{
		nvml:         nvmllib,
		config:       config,
		devices:      devices,
		unhealthy:    make(chan *Device, 64),
		xidsDisabled: xids,
	}
	return p, nil
}

Note that we return a noopHealthProvider if we are tolerant of Init errors at this point. However, we update Start() to look something like:

// Start initializes NVML, registers event handlers, and starts the
// monitoring goroutine. Blocks until initialization completes.
func (r *nvmlHealthProvider) Start(ctx context.Context) (rerr error) {
	r.Lock()
	defer r.Unlock()
	if r.started {
		// TODO: Is this an error condition? Could we just return?
		return fmt.Errorf("health provider already started")
	}
	r.Unlock()

	// Initialize NVML
	ret := r.nvml.Init()
	if ret != nvml.SUCCESS {
		return fmt.Errorf("failed to initialize NVML: %v", ret)
	}
	defer func() {
		if rerr != nil {
			_ = r.nvml.Shutdown()
		}
	}()

	// Create event set
	eventSet, ret := r.nvml.EventSetCreate()
	if ret != nvml.SUCCESS {
		return fmt.Errorf("failed to create event set: %v", ret)
	}
	defer func() {
		if rerr != nil {
			_ = eventSet.Free()
		}
	}()

	// Register devices
	if err := r.registerDevices(eventSet); err != nil {
		return fmt.Errorf("failed to register devices: %w", err)
	}

	klog.Infof("Health monitoring started for %d devices", len(r.devices))

	// Create child context
	r.ctx, r.cancel = context.WithCancel(ctx)

	// Start monitoring goroutine
	r.wg.Add(1)
	go r.runEventMonitor(eventSet)

	r.started = true

	return nil
}

Where we take actions immediately in the event of failure so that we don't have to rely on cleanup() being called eventually for these resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed suggestion, adopted

}
return fmt.Errorf("failed to create event set: %v", ret)
}
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that we don't use the deferred cleanup here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All error paths after Init() have now individual clean up logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

}

func (n *noopHealthProvider) Start(context.Context) error {
n.healthChan = make(chan *Device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do this at construction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to actually create a channel? Can we no leave it nil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

r.Lock()
if r.started {
r.Unlock()
return fmt.Errorf("health provider already started")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an error? Assuming we only set started once the HealthProvider / HealthMonitor has been successfully started, is there any harm in calling it again?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, now is a return nil

Comment on lines 124 to 125
// Get XID filter configuration
r.xidsDisabled = getDisabledHealthCheckXids()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to construction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

e, ret := eventSet.Wait(5000)
// Wait for NVML event (5 second timeout)
event, ret := r.eventSet.Wait(5000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

continue
}
// Create child context
r.ctx, r.cancel = context.WithCancel(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on whether we need to add WithCancel to the context if it already has this done for the plugin?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now use the plugin context

}()
// Initialize and start health provider
plugin.ctx, plugin.cancel = context.WithCancel(context.Background())
plugin.healthProvider = plugin.rm.HealthProvider()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to the constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

socket: getPluginSocketPath(resourceManager.Resource()),
// These will be reinitialized every
// time the plugin server is restarted.
// server and healthProvider will be reinitialized every time
Copy link
Member

@elezar elezar Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we instantiate the healthProvider here? Why is it required to reinitilize it on every start?

Would the following be valid?

Suggested change
// server and healthProvider will be reinitialized every time
healthProvider: resourceManager.HealthProvider(),
// server and healthProvider will be reinitialized every time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}
}()
// Initialize and start health provider
plugin.ctx, plugin.cancel = context.WithCancel(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At which point do we rather pass in a ctx and handle the Cancel() call externally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we now use the plugin context

Extract device health checking logic into a dedicated HealthProvider
interface with proper lifecycle management using WaitGroups and context.

- Add HealthProvider interface (Start/Stop/Health methods)
- Implement nvmlHealthProvider with WaitGroup coordination
- Update ResourceManager to return HealthProvider instead of CheckHealth
- Update device plugin to use HealthProvider
- Add no-op implementation for Tegra devices

This refactoring improves code modularity and testability without
changing existing behavior. Prepares foundation for future device
recovery features.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants