-
Notifications
You must be signed in to change notification settings - Fork 759
Refactor checkHealth function #1508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
541c6cb to
44d450f
Compare
internal/plugin/server.go
Outdated
| }() | ||
|
|
||
| // Start recovery worker to detect when unhealthy devices become healthy | ||
| go plugin.runRecoveryWorker() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split the refactoring (that doesn't add any new behaviour) into a different PR from the one that adds devices becoming healthy again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good idea, and even more based on your other comment #1508 (review)
I wanted a re-factor, but that interface is a diff conversation. Going to work on splitting this PR
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of the k8s-dra-driver-gpu we discused the Interface that we would expect a DeviceHealthCheckProvider to have. Where is that considered here? From the perspective of the device plugin (or its associated ResourceManager), I would expect a DevideHealthCheckProvider to be instantiated and we would develop against this intervace.
As I discussed in NVIDIA/k8s-dra-driver-gpu#689 I would expect this interface to look something like:
type DeviceHealthCheckProvider interface {
Start(context.Context) error
Stop()
Health() <-channel Device
(alternatively one could split the Health channel into Healthy() and Unhealthy()).
internal/plugin/server.go
Outdated
| // If health provider not available, wait for context cancellation | ||
| if plugin.healthProvider == nil { | ||
| <-plugin.ctx.Done() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under which conditions is the healthProvider nil? Could we not rather ALWAYS use at least a "no-op" healthProvider to ensure that we don't need to special case this here or at any point where we call Start or Stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| // envDisableHealthChecks defines the environment variable that is checked to determine whether healthchecks | ||
| // should be disabled. If this envvar is set to "all" or contains the string "xids", healthchecks are | ||
| // disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that | ||
| // this is in addition to the Application errors that are already ignored. | ||
| // envDisableHealthChecks defines the environment variable that is | ||
| // checked to determine whether healthchecks should be disabled. If | ||
| // this envvar is set to "all" or contains the string "xids", | ||
| // healthchecks are disabled entirely. If set, the envvar is treated | ||
| // as a comma-separated list of Xids to ignore. Note that this is in | ||
| // addition to the Application errors that are already ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit: For complex refactorings, keeping changes to a minimum is important as we are able to reduce the noise and focus on the changes. In cases like these, we should update these comments as a separate commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now in a [no-relnote] commit
internal/rm/health.go
Outdated
| nvml: nvml, | ||
| config: config, | ||
| devices: devices, | ||
| healthChan: make(chan *Device, 64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why 64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size would be len(devices) × 4, but I thought 64 was a safe hard coded number as it covers all possible len(devices) sizes
internal/rm/health.go
Outdated
| if p.started { | ||
| p.mu.Unlock() | ||
| return fmt.Errorf("health provider already started") | ||
| } | ||
| p.started = true | ||
| p.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not defer p.mu.Unlock() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer would be simpler but slower. Using defer would hold the mutex during {NVML initialization, Event set creation, Device registration}, blocking other operations (like Stop()).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree that "slowness" is something we should optimize for. Start() and Stop() should not be running concurrently. As implemented, because we release the lock before Start() has completed (and similarly for Stop()) the remaining code may end up overlapping which is not what we want.
Also note that we set started to true before the health monitor is actually ready and this is not reset in the envent of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion adopted
internal/rm/health.go
Outdated
| wg sync.WaitGroup | ||
|
|
||
| // State guards | ||
| mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use an IsA relationship to simplify taking and releasing the lock:
| mu sync.Mutex | |
| sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
internal/rm/health.go
Outdated
| ret := p.nvml.Init() | ||
| if ret != nvml.SUCCESS { | ||
| if *r.config.Flags.FailOnInitError { | ||
| if *p.config.Flags.FailOnInitError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Let's not rename r to p in a single commit. (see comment on managing diffs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted , now an independent commit
internal/rm/health.go
Outdated
| p.xidsDisabled = getDisabledHealthCheckXids() | ||
| if p.xidsDisabled.IsAllDisabled() { | ||
| klog.Info("Health checks disabled via DP_DISABLE_HEALTHCHECKS") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen at construction and not as Start is called. If all healthChecks are disabled, we should return a no-op HealthProvider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion, adopted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you have pulled this up to the func (r *nvmlResourceManager) HealthProvider() HealthProvider { implementation. This means that we have to construct the list of disabled XIDs twice. Why not handle this (and other static config) in the NewNVMLHealthProvider constructor instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to NewNVMLHealthProvider now
| klog.Warningf("NVML init failed: %v; health checks disabled", ret) | ||
| return nil | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the move away from a deferred shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All error paths after Init() have now individual clean up logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more error prone than it needs to be. Note that we may want to distinguish between acceptable errors and non acceptable ones. As was the case for the related k8s-dra-driver PR, we may want to call Init() in the constructor and handle these errors (with deferred Shutdown() separately from errors that are triggered during Start.
For example, what about updating the constructor to:
// NewNVMLHealthProvider creates a new health provider for NVML devices.
// Does not start monitoring - caller must call Start().
func newNVMLHealthProvider(nvmllib nvml.Interface, config *spec.Config, devices Devices) (HealthProvider, error) {
xids := getDisabledHealthCheckXids()
if xids.IsAllDisabled() {
return &noopHealthProvider{}, nil
}
ret := nvmllib.Init()
if ret != nvml.SUCCESS {
if *config.Flags.FailOnInitError {
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
klog.Warningf("NVML init failed: %v; health checks disabled", ret)
return &noopHealthProvider{}, nil
}
defer func() {
ret := nvmllib.Shutdown()
if ret != nvml.SUCCESS {
klog.Infof("Error shutting down NVML: %v", ret)
}
}()
klog.Infof("Ignoring the following XIDs for health checks: %v", xids)
p := &nvmlHealthProvider{
nvml: nvmllib,
config: config,
devices: devices,
unhealthy: make(chan *Device, 64),
xidsDisabled: xids,
}
return p, nil
}
Note that we return a noopHealthProvider if we are tolerant of Init errors at this point. However, we update Start() to look something like:
// Start initializes NVML, registers event handlers, and starts the
// monitoring goroutine. Blocks until initialization completes.
func (r *nvmlHealthProvider) Start(ctx context.Context) (rerr error) {
r.Lock()
defer r.Unlock()
if r.started {
// TODO: Is this an error condition? Could we just return?
return fmt.Errorf("health provider already started")
}
r.Unlock()
// Initialize NVML
ret := r.nvml.Init()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to initialize NVML: %v", ret)
}
defer func() {
if rerr != nil {
_ = r.nvml.Shutdown()
}
}()
// Create event set
eventSet, ret := r.nvml.EventSetCreate()
if ret != nvml.SUCCESS {
return fmt.Errorf("failed to create event set: %v", ret)
}
defer func() {
if rerr != nil {
_ = eventSet.Free()
}
}()
// Register devices
if err := r.registerDevices(eventSet); err != nil {
return fmt.Errorf("failed to register devices: %w", err)
}
klog.Infof("Health monitoring started for %d devices", len(r.devices))
// Create child context
r.ctx, r.cancel = context.WithCancel(ctx)
// Start monitoring goroutine
r.wg.Add(1)
go r.runEventMonitor(eventSet)
r.started = true
return nil
}
Where we take actions immediately in the event of failure so that we don't have to rely on cleanup() being called eventually for these resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed suggestion, adopted
| } | ||
| return fmt.Errorf("failed to create event set: %v", ret) | ||
| } | ||
| defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason that we don't use the deferred cleanup here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All error paths after Init() have now individual clean up logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
internal/rm/tegra_manager.go
Outdated
| } | ||
|
|
||
| func (n *noopHealthProvider) Start(context.Context) error { | ||
| n.healthChan = make(chan *Device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do this at construction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need to actually create a channel? Can we no leave it nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
91f4a6c to
33636eb
Compare
| r.Lock() | ||
| if r.started { | ||
| r.Unlock() | ||
| return fmt.Errorf("health provider already started") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an error? Assuming we only set started once the HealthProvider / HealthMonitor has been successfully started, is there any harm in calling it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, now is a return nil
internal/rm/health.go
Outdated
| // Get XID filter configuration | ||
| r.xidsDisabled = getDisabledHealthCheckXids() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
internal/rm/health.go
Outdated
| e, ret := eventSet.Wait(5000) | ||
| // Wait for NVML event (5 second timeout) | ||
| event, ret := r.eventSet.Wait(5000) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
internal/rm/health.go
Outdated
| continue | ||
| } | ||
| // Create child context | ||
| r.ctx, r.cancel = context.WithCancel(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment on whether we need to add WithCancel to the context if it already has this done for the plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now use the plugin context
internal/plugin/server.go
Outdated
| }() | ||
| // Initialize and start health provider | ||
| plugin.ctx, plugin.cancel = context.WithCancel(context.Background()) | ||
| plugin.healthProvider = plugin.rm.HealthProvider() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved
internal/plugin/server.go
Outdated
| socket: getPluginSocketPath(resourceManager.Resource()), | ||
| // These will be reinitialized every | ||
| // time the plugin server is restarted. | ||
| // server and healthProvider will be reinitialized every time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we instantiate the healthProvider here? Why is it required to reinitilize it on every start?
Would the following be valid?
| // server and healthProvider will be reinitialized every time | |
| healthProvider: resourceManager.HealthProvider(), | |
| // server and healthProvider will be reinitialized every time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
internal/plugin/server.go
Outdated
| } | ||
| }() | ||
| // Initialize and start health provider | ||
| plugin.ctx, plugin.cancel = context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At which point do we rather pass in a ctx and handle the Cancel() call externally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we now use the plugin context
Extract device health checking logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context. - Add HealthProvider interface (Start/Stop/Health methods) - Implement nvmlHealthProvider with WaitGroup coordination - Update ResourceManager to return HealthProvider instead of CheckHealth - Update device plugin to use HealthProvider - Add no-op implementation for Tegra devices This refactoring improves code modularity and testability without changing existing behavior. Prepares foundation for future device recovery features. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
This patch refactors the device health check system by extracting the logic into a dedicated HealthProvider interface with proper lifecycle management using WaitGroups and context.
No behavior changes - this is a pure refactoring to improve code modularity and testability.