feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679

nv-oviya · 2025-12-02T00:58:03Z

Overview:

Implements runtime fault injection control via filesystem toggles and hostPath volumes, eliminating the need for pod restarts when enabling/disabling CUDA faults (except for initial library deployment).

Previously, enabling/disabling CUDA fault injection required modifying environment variables and restarting pods, which is unrealistic for testing fault tolerance recovery scenarios. This PR introduces a 3-tier toggling system that allows instant fault activation/deactivation in running pods.

Details:

cuda_intercept.c - Runtime Toggle Detection
- Modified get_fault_config() to check fault injection status on every CUDA call rather than once at initialization
- Implements 3-tier fallback: /host-fault/cuda_fault_enabled (persistent) → /tmp/cuda_fault_enabled (ephemeral) → CUDA_FAULT_INJECTION_ENABLED env var
- The hostPath check ensures faults persist across pod crashes/restarts on the same node, accurately simulating persistent hardware failure
- When pod reschedules to a different node → no fault marker file → automatic recovery
inject_into_pods.py - Persistent volume infrastructure
- Adds node-fault-marker hostPath volume mounting /var/lib/cuda-fault-test (host) to /host-fault (pod)
- Uses DirectoryOrCreate to ensure directory exists on node
- Introduces passthrough_mode parameter: deploys library with CUDA_FAULT_INJECTION_ENABLED=0, allowing baseline testing before fault injection
- Sets aggressive update strategy (maxUnavailable=100%, maxSurge=0) to ensure all pods update simultaneously when enabling
- Force-deletes pods when enabling to apply changes immediately (one-time operation)
cuda_fault_injection.py - New helper methods
- check_if_cuda_library_deployed(): Detects if CUDA library is already injected (checks LD_PRELOAD, init containers, volumes)
- enable_cuda_faults_via_toggle(): Writes "1" to /host-fault/cuda_fault_enabled in running pods via kubectl exec (no restart needed)
- disable_cuda_faults_via_toggle(): Writes "0" to toggle file for instant recovery
- cleanup_node_fault_markers(): Removes /host-fault/cuda_fault_enabled from nodes (cleanup after tests)
- verify_env_var_set(): Validates environment variable propagation to deployment spec
README.md - Updated documentation

Architecture Context:

Phase 0: Deploy library in passthrough mode (baseline testing)
Phase 1: Toggle faults ON via filesystem → pods crash naturally
Phase 2: Toggle faults OFF via filesystem → pods recover on restart
No forced deletions or restarts needed after initial setup
hostPath persistence simulates real hardware failure that persists until pod reschedules

Where should the reviewer start?

Start with cuda_intercept.c changes (lines 62-130):
- Review the modified get_fault_config() function
- Understand the 3-tier fallback mechanism (hostPath → /tmp → env var)
- Ensure this is checked on every CUDA call for instant toggling
Next, review inject_into_pods.py (lines 203-215, 268-278):
- Examine the hostPath volume definition (/var/lib/cuda-fault-test)
- Check how it is mounted to /host-fault with write permissions
- Review passthrough_mode parameter introduction (line 348)
- Then check cuda_fault_injection.py new methods:
  - enable_cuda_faults_via_toggle() - See how it uses kubectl exec to write toggle file
  - disable_cuda_faults_via_toggle() - Same mechanism for recovery
  - cleanup_node_fault_markers() - Cleanup logic for test teardown

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Builds on PR #4038

… cuda fault injections w/o requiring restarts or force deletion (besides initial setup) Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

copy-pr-bot · 2025-12-02T00:58:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

…/cuda-hostpath-method Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

use hostPath, file based toggle, and env vars to enable node-specific…

a7c7769

… cuda fault injections w/o requiring restarts or force deletion (besides initial setup) Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

pull-request-size bot added the size/L label Dec 2, 2025

github-actions bot added the feat label Dec 2, 2025

pull-request-size bot added size/XXL and removed size/L labels Dec 2, 2025

style: apply black, clang-format, and ruff formatting fixes

dd33678

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

nv-oviya force-pushed the oviya/fault-injection/cuda-hostpath-method branch from d50a27d to dd33678 Compare December 2, 2025 02:35

pull-request-size bot added size/L and removed size/XXL labels Dec 2, 2025

This was referenced Dec 2, 2025

refactor(feat-injection): create PyTest fixtures to replace boilerplate in E2E HW FT tests #4690

Draft

test(feat-injection): add 200x minimized XID 79 PyTest harnessing fixtures #4694

Draft

Merge remote-tracking branch 'origin/main' into oviya/fault-injection…

3a7feb1

…/cuda-hostpath-method Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679

feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679

Uh oh!

nv-oviya commented Dec 2, 2025

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679

Are you sure you want to change the base?

feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679

Uh oh!

Conversation

nv-oviya commented Dec 2, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

copy-pr-bot bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants