Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Nov 18, 2025

Modified Merge() to log and skip failed labelers instead of aborting the entire pipeline. When a device fails (XID error, GPU lost, etc.), GFD continues with healthy devices and generates partial labels.

@ArangoGutierrez ArangoGutierrez self-assigned this Nov 18, 2025
@ArangoGutierrez ArangoGutierrez added the feature issue/PR that proposes a new feature or functionality label Nov 18, 2025
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review November 18, 2025 18:45
@ArangoGutierrez ArangoGutierrez changed the title Enhance device health check with compute capability probe [GFD] Prevent GFD from crashing when device is unhealthy Nov 18, 2025
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is what you're doing here different to just ignoring errors during labelling?

@ArangoGutierrez
Copy link
Collaborator Author

How is what you're doing here different to just ignoring errors during labelling?

PTAL

Make Merge() resilient to individual labeler failures by logging errors
as warnings and continuing with remaining labelers. This prevents GFD
from crashing when devices go unhealthy (e.g., XID errors) and allows
partial label generation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez marked this pull request as draft November 21, 2025 11:45
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review November 21, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants