test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

nv-oviya · 2025-12-02T04:28:29Z

Overview

Adds a 56-line XID 79 test demonstrating the fixture framework from PR #4690. What previously required 600+ lines is now 2-3 lines per test. Includes examples README with usage guide and NVSentinel troubleshooting.

Details

Files Changed:

test_xid79_minimal.py - Three test variants using fixtures
README.md - Quick start guide, fixtures reference, troubleshooting

Test variants:

test_xid79_full_automation - Full NVSentinel workflow (cordon → drain → remediate → uncordon)
test_xid79_cordon_drain_only - Cordon + drain only (most common, ~5-10 min)
test_xid79_all_gpus - Parametrized test across GPUs 0-3

Test flow (automated by fixtures):

Phase 0: Natural pod distribution + baseline
Phase 1: Persistent fault injection via hostPath (pods on faulty node crash-loop)
Phase 2: NVSentinel cordons and drains node
Phase 3: Pods reschedule to healthy nodes, inference recovers

README sections:

Quick Start (single command)
Available fixtures reference
What gets validated
Troubleshooting NVSentinel AllowCompletion mode issue (causes test hangs)

Where should the reviewer start?

Read test file - Three 2-3 line tests showing fixture usage
Check README - Especially troubleshooting section (documents NVSentinel drain policy issue we discovered)
Compare w/ test(fault-injection): Add XID 79 NVSentinel E2E test #4046

Related Issues

Depends on PR feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679 (hostpath), PR refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690 (fixtures), PR feat(fault-injection): Add latency percentile metrics and per-phase statistics tracking #4692 (metrics)
Part of fault tolerance initiative.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

copy-pr-bot · 2025-12-02T04:28:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

add XID 79 persistent GPU failure E2E test with NVSentinel integration

a8d8ede

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

pull-request-size bot added the size/L label Dec 2, 2025

github-actions bot added the test label Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

nv-oviya commented Dec 2, 2025

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Are you sure you want to change the base?

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Conversation

nv-oviya commented Dec 2, 2025

Overview

Details

Where should the reviewer start?

Related Issues

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants