-
Notifications
You must be signed in to change notification settings - Fork 221
Fixing GPU Adapter Count test to be more dynamic and fail resistent #4038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
umfranci
commented
Oct 10, 2025
- The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
- This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
- Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
- Suggested Fix:
- Primary detection: Continue using the existing hardcoded GPU list for known models
- Fallback mechanism: When no matches are found in the hardcoded list:
- Group VMBus devices by their last segment (device ID suffix)
- Identify GPU device groups where all entries are marked as "PCI Express pass-through"
- Validate the count matches nvidia-smi output for accuracy
- Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list
|
@squirrelsc @LiliDeng any further inputs/comments on this please? |
lisa/features/gpu.py
Outdated
|
|
||
| return 0 | ||
|
|
||
| def _get_gpu_count_by_device_id_segment(self, vmbus_devices: List[Any]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks this method doesn't help more than the raw information. The all vmbus devices should be listed by previous commands in LISA log for troubleshooting. If the list is not long like over 50, it doesn't need to check and print again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, the initial intent was to utilize this segmentation in order to try and reduce the failure rate of the test case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about remove this method?
| def _has_sequential_pattern(self, devices: List[Any]) -> bool: | ||
| """ | ||
| Check if devices have sequential numbering in their IDs. | ||
| GPUs typically have patterns like 0101, 0102, 0103, 0104. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you find this info? Could you add a link above? If there are other types of devices, maybe they’re listed in a similar way too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not find an official doc for it but this was a usual trend observed for multi-GPU SKUs like GB200 and MI300. Example:
Device_ID = {56475055-0002-0000-3130-303237344131}
Device_ID = {56475055-0003-0000-3130-303237344131}
Device_ID = {56475055-0004-0000-3130-303237344131}
Device_ID = {56475055-0005-0000-3130-303237344131}
Device_ID = {56475055-0006-0000-3130-303237344131}
Device_ID = {56475055-0007-0000-3130-303237344131}
Device_ID = {56475055-0008-0000-3130-303237344131}
Device_ID = {56475055-0009-0000-3130-303237344131}
Device_ID = {00000003-0101-0000-3135-423331303142}
Device_ID = {00000203-0102-0000-3135-423331303142}
Device_ID = {00001003-0103-0001-3135-423331303142}
Device_ID = {00001203-0104-0001-3135-423331303142}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an official pattern, and maybe confusing by other devices type in future. Please remove them.