Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Fault Tolerance Tests

End-to-end tests for GPU failure scenarios with automated validation.

## Quick Start

```bash
# Set your deployment
export TEST_NAMESPACE="your-namespace"
export TARGET_DEPLOYMENT="your-deployment"

# Run XID 79 test (GPU fell off bus)
pytest test_xid79_minimal.py::test_xid79_cordon_drain_only -v -s
```

## Available Fixtures

### Test Fixtures

- **`xid79_test`** - GPU fell off bus (XID 79)
- **`xid74_test`** - NVLink failure (XID 74)

### Expectation Fixtures

- **`expect_cordon_and_drain`** - Node cordoned + pods evicted (5-10 min)
- **`expect_full_automation`** - Full NVSentinel automation (cordon → drain → remediate → uncordon)
- **`expect_cordon_only`** - Only cordon, no drain

## Example Test

```python
@pytest.mark.xid79
def test_xid79_cordon_drain_only(xid79_test, expect_cordon_and_drain):
"""Test GPU failure with NVSentinel cordon and drain."""
xid79_test(gpu_id=0, expect=expect_cordon_and_drain)
```

## What Gets Validated

- Fault injection (pods crash on target node)
- NVSentinel response (cordon/drain)
- Pod recovery (reschedule to healthy nodes)
- Inference recovery (service restored)
- Latency impact analysis (baseline → degraded → recovered)

## Requirements

- NVSentinel deployed with `DeleteAfterTimeout` mode
- Multi-node cluster with GPU nodes
- Target deployment with 2+ worker pods

## Troubleshooting

**Test fails with "Pods did not recover" after 15+ minutes?**

NVSentinel is likely in `AllowCompletion` mode, which never evicts crash-looping pods.

**Fix:**
```bash
kubectl edit configmap node-drainer-config -n nvsentinel
```

Change:
```toml
deleteAfterTimeoutMinutes = 5 # From 60
[[userNamespaces]]
name = "*"
mode = "DeleteAfterTimeout" # From "AllowCompletion"
```

Restart:
```bash
kubectl rollout restart deployment nvsentinel-node-drainer -n nvsentinel
```

Test will now complete in ~5-10 minutes with proper pod eviction and recovery.

Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
XID 79 E2E Test - Persistent Hardware Failure Simulation

This tests realistic node-level GPU failure with persistent hardware fault:

Phase 0: Natural pod distribution across nodes (realistic!)
Phase 1: Fault injection on target node (persistent via hostPath)
- Fault marker written to /var/lib/cuda-fault-test on node (survives pod restarts)
- Pods on faulty node crash-loop indefinitely (realistic!)
- Pods on other nodes stay healthy
- Inference partially degraded

Phase 2: NVSentinel response
- Cordons faulty node
- Waits for drain/eviction (NVSentinel policy-dependent)
- Crashed pods reschedule to healthy nodes where fault marker absent
- Pods on healthy nodes recover automatically (no fault file there)

Phase 3: Recovery
- All pods on healthy nodes (no fault marker)
- Inference fully recovered
- Test cleanup removes fault marker from node

This simulates: Persistent GPU hardware failure requiring rescheduling
(not transient failures that restart-recovery can fix)

"""

import pytest


@pytest.mark.xid79
@pytest.mark.nvsentinel
@pytest.mark.slow
def test_xid79_full_automation(xid79_test, expect_full_automation):
"""XID 79 with full NVSentinel automation (detection → cordon → drain → remediate → uncordon)."""
xid79_test(gpu_id=0, expect=expect_full_automation)


@pytest.mark.xid79
@pytest.mark.nvsentinel
def test_xid79_cordon_drain_only(xid79_test, expect_cordon_and_drain):
"""XID 79 with cordon + drain (no auto-remediation)."""
xid79_test(gpu_id=0, expect=expect_cordon_and_drain)


@pytest.mark.xid79
@pytest.mark.parametrize("gpu_id", [0, 1, 2, 3])
def test_xid79_all_gpus(xid79_test, expect_cordon_and_drain, gpu_id):
"""Test XID 79 on each GPU."""
xid79_test(gpu_id=gpu_id, expect=expect_cordon_and_drain)


if __name__ == "__main__":
pytest.main([__file__, "-v", "-s", "-m", "xid79"])

Loading