Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Dec 2, 2025

Overview

Adds a 56-line XID 79 test demonstrating the fixture framework from PR #4690. What previously required 600+ lines is now 2-3 lines per test. Includes examples README with usage guide and NVSentinel troubleshooting.


Details

Files Changed:

  • test_xid79_minimal.py - Three test variants using fixtures
  • README.md - Quick start guide, fixtures reference, troubleshooting

Test variants:

  • test_xid79_full_automation - Full NVSentinel workflow (cordon → drain → remediate → uncordon)
  • test_xid79_cordon_drain_only - Cordon + drain only (most common, ~5-10 min)
  • test_xid79_all_gpus - Parametrized test across GPUs 0-3

Test flow (automated by fixtures):

  • Phase 0: Natural pod distribution + baseline
  • Phase 1: Persistent fault injection via hostPath (pods on faulty node crash-loop)
  • Phase 2: NVSentinel cordons and drains node
  • Phase 3: Pods reschedule to healthy nodes, inference recovers

README sections:

  • Quick Start (single command)
  • Available fixtures reference
  • What gets validated
  • Troubleshooting NVSentinel AllowCompletion mode issue (causes test hangs)

Where should the reviewer start?


Related Issues

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants