Skip to content

Conversation

@umfranci
Copy link
Collaborator

  • The verify_gpu_adapter_count test validates GPU counts by comparing outputs from lsvmbus, lspci, and nvidia-smi commands. However, it relies on a hardcoded list of GPU models and their device IDs to identify GPUs in the lsvmbus output.
  • This hardcoded approach fails when testing new GPU models, requiring manual code updates each time a new GPU hardware is released. This creates testing delays, maintenance overhead and increases failure percentage of the test.
  • Hence the aim here is to implement dynamic GPU detection to automatically identify new GPU models without manual intervention, while maintaining backward compatibility with existing GPU detection logic.
  • Suggested Fix:
    • Primary detection: Continue using the existing hardcoded GPU list for known models
    • Fallback mechanism: When no matches are found in the hardcoded list:
      • Group VMBus devices by their last segment (device ID suffix)
      • Identify GPU device groups where all entries are marked as "PCI Express pass-through"
      • Validate the count matches nvidia-smi output for accuracy
    • Direct counting: Added a new function to get GPU count directly from nvidia-smi command output, eliminating dependency on maintaining a hardcoded GPU model list

@umfranci umfranci marked this pull request as ready for review October 28, 2025 21:59
@umfranci umfranci requested a review from LiliDeng as a code owner October 28, 2025 21:59
@umfranci
Copy link
Collaborator Author

@squirrelsc @LiliDeng any further inputs/comments on this please?


return 0

def _get_gpu_count_by_device_id_segment(self, vmbus_devices: List[Any]) -> int:
Copy link
Member

@squirrelsc squirrelsc Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks this method doesn't help more than the raw information. The all vmbus devices should be listed by previous commands in LISA log for troubleshooting. If the list is not long like over 50, it doesn't need to check and print again.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, the initial intent was to utilize this segmentation in order to try and reduce the failure rate of the test case!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about remove this method?

def _has_sequential_pattern(self, devices: List[Any]) -> bool:
"""
Check if devices have sequential numbering in their IDs.
GPUs typically have patterns like 0101, 0102, 0103, 0104.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you find this info? Could you add a link above? If there are other types of devices, maybe they’re listed in a similar way too.

Copy link
Collaborator Author

@umfranci umfranci Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not find an official doc for it but this was a usual trend observed for multi-GPU SKUs like GB200 and MI300. Example:

Device_ID = {56475055-0002-0000-3130-303237344131}
Device_ID = {56475055-0003-0000-3130-303237344131}
Device_ID = {56475055-0004-0000-3130-303237344131}
Device_ID = {56475055-0005-0000-3130-303237344131}
Device_ID = {56475055-0006-0000-3130-303237344131}
Device_ID = {56475055-0007-0000-3130-303237344131}
Device_ID = {56475055-0008-0000-3130-303237344131}
Device_ID = {56475055-0009-0000-3130-303237344131}

Device_ID = {00000003-0101-0000-3135-423331303142}
Device_ID = {00000203-0102-0000-3135-423331303142}
Device_ID = {00001003-0103-0001-3135-423331303142}
Device_ID = {00001203-0104-0001-3135-423331303142}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an official pattern, and maybe confusing by other devices type in future. Please remove them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants