Skip to content

Conversation

@nv-oviya
Copy link
Contributor

@nv-oviya nv-oviya commented Dec 2, 2025

Overview:

Implements runtime fault injection control via filesystem toggles and hostPath volumes, eliminating the need for pod restarts when enabling/disabling CUDA faults (except for initial library deployment).

Previously, enabling/disabling CUDA fault injection required modifying environment variables and restarting pods, which is unrealistic for testing fault tolerance recovery scenarios. This PR introduces a 3-tier toggling system that allows instant fault activation/deactivation in running pods.


Details:

  1. cuda_intercept.c - Runtime Toggle Detection
    • Modified get_fault_config() to check fault injection status on every CUDA call rather than once at initialization
    • Implements 3-tier fallback: /host-fault/cuda_fault_enabled (persistent) → /tmp/cuda_fault_enabled (ephemeral) → CUDA_FAULT_INJECTION_ENABLED env var
    • The hostPath check ensures faults persist across pod crashes/restarts on the same node, accurately simulating persistent hardware failure
    • When pod reschedules to a different node → no fault marker file → automatic recovery
  2. inject_into_pods.py - Persistent volume infrastructure
    • Adds node-fault-marker hostPath volume mounting /var/lib/cuda-fault-test (host) to /host-fault (pod)
    • Uses DirectoryOrCreate to ensure directory exists on node
    • Introduces passthrough_mode parameter: deploys library with CUDA_FAULT_INJECTION_ENABLED=0, allowing baseline testing before fault injection
    • Sets aggressive update strategy (maxUnavailable=100%, maxSurge=0) to ensure all pods update simultaneously when enabling
    • Force-deletes pods when enabling to apply changes immediately (one-time operation)
  3. cuda_fault_injection.py - New helper methods
    • check_if_cuda_library_deployed(): Detects if CUDA library is already injected (checks LD_PRELOAD, init containers, volumes)
    • enable_cuda_faults_via_toggle(): Writes "1" to /host-fault/cuda_fault_enabled in running pods via kubectl exec (no restart needed)
    • disable_cuda_faults_via_toggle(): Writes "0" to toggle file for instant recovery
    • cleanup_node_fault_markers(): Removes /host-fault/cuda_fault_enabled from nodes (cleanup after tests)
    • verify_env_var_set(): Validates environment variable propagation to deployment spec
  4. README.md - Updated documentation

Architecture Context:

  • Phase 0: Deploy library in passthrough mode (baseline testing)
  • Phase 1: Toggle faults ON via filesystem → pods crash naturally
  • Phase 2: Toggle faults OFF via filesystem → pods recover on restart
  • No forced deletions or restarts needed after initial setup
  • hostPath persistence simulates real hardware failure that persists until pod reschedules

Where should the reviewer start?

  1. Start with cuda_intercept.c changes (lines 62-130):
    • Review the modified get_fault_config() function
    • Understand the 3-tier fallback mechanism (hostPath → /tmp → env var)
    • Ensure this is checked on every CUDA call for instant toggling
  2. Next, review inject_into_pods.py (lines 203-215, 268-278):
    • Examine the hostPath volume definition (/var/lib/cuda-fault-test)
    • Check how it is mounted to /host-fault with write permissions
    • Review passthrough_mode parameter introduction (line 348)
    • Then check cuda_fault_injection.py new methods:
      • enable_cuda_faults_via_toggle() - See how it uses kubectl exec to write toggle file
      • disable_cuda_faults_via_toggle() - Same mechanism for recovery
      • cleanup_node_fault_markers() - Cleanup logic for test teardown

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Builds on PR #4038

… cuda fault injections w/o requiring restarts or force deletion (besides initial setup)

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
…/cuda-hostpath-method

Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants