feat(fault-injection): Enable runtime CUDA fault injection toggling without pod restarts #4679
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview:
Implements runtime fault injection control via filesystem toggles and hostPath volumes, eliminating the need for pod restarts when enabling/disabling CUDA faults (except for initial library deployment).
Previously, enabling/disabling CUDA fault injection required modifying environment variables and restarting pods, which is unrealistic for testing fault tolerance recovery scenarios. This PR introduces a 3-tier toggling system that allows instant fault activation/deactivation in running pods.
Details:
cuda_intercept.c- Runtime Toggle Detectionget_fault_config()to check fault injection status on every CUDA call rather than once at initialization/host-fault/cuda_fault_enabled(persistent) →/tmp/cuda_fault_enabled(ephemeral) → CUDA_FAULT_INJECTION_ENABLED env varinject_into_pods.py- Persistent volume infrastructure/var/lib/cuda-fault-test(host) to/host-fault(pod)CUDA_FAULT_INJECTION_ENABLED=0, allowing baseline testing before fault injectioncuda_fault_injection.py- New helper methodscheck_if_cuda_library_deployed(): Detects if CUDA library is already injected (checks LD_PRELOAD, init containers, volumes)enable_cuda_faults_via_toggle(): Writes "1" to /host-fault/cuda_fault_enabled in running pods via kubectl exec (no restart needed)disable_cuda_faults_via_toggle(): Writes "0" to toggle file for instant recoverycleanup_node_fault_markers(): Removes /host-fault/cuda_fault_enabled from nodes (cleanup after tests)verify_env_var_set(): Validates environment variable propagation to deployment specArchitecture Context:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Builds on PR #4038