ai-dynamo
diff --git a/‎test_gpu_health_check_with_nvsentinel/test.log.txt‎
Lines changed: 5 additions & 0 deletions b/‎test_gpu_health_check_with_nvsentinel/test.log.txt‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎test_network_partition_frontend_to_worker_with_recovery/test.log.txt‎ b/‎test_network_partition_frontend_to_worker_with_recovery/test.log.txt‎
diff --git a/‎test_network_partition_worker_to_nats_with_recovery/test.log.txt‎ b/‎test_network_partition_worker_to_nats_with_recovery/test.log.txt‎
diff --git a/‎tests/fault_tolerance/hardware/GPU_PROCESS_TRACKING.md‎
Lines changed: 293 additions & 0 deletions b/‎tests/fault_tolerance/hardware/GPU_PROCESS_TRACKING.md‎
Lines changed: 293 additions & 0 deletions
diff --git a/‎tests/fault_tolerance/hardware/fault-injection-service/.dockerignore‎
Lines changed: 22 additions & 0 deletions b/‎tests/fault_tolerance/hardware/fault-injection-service/.dockerignore‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎tests/fault_tolerance/hardware/fault-injection-service/.gitignore‎
Lines changed: 48 additions & 0 deletions b/‎tests/fault_tolerance/hardware/fault-injection-service/.gitignore‎
Lines changed: 48 additions & 0 deletions
@@ -0,0 +1,5 @@
+[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Simulating GPU failure on node aks-a100b-22138447-vmss000001
+[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Labeled node aks-a100b-22138447-vmss000001 with gpu-health=failed
+[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Cordoning node aks-a100b-22138447-vmss000001
+[TEST] 2025-10-20T14:03:06 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Waiting for pod 5e945c5c-6ad5-4420-ba54-0322db4fe06b to be rescheduled (timeout=300s)
+[TEST] 2025-10-20T14:08:11 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Uncordoning node aks-a100b-22138447-vmss000001
@@ -0,0 +1,293 @@
+# Per-GPU Process Tracking (Without MPU)
+
+## Problem
+
+**nvidia-smi inside containers doesn't show process lists** due to PID namespace isolation. This is the exact problem MPU (https://github.com/matpool/mpu) solves, but:
+
+- MPU requires kernel module compilation
+- Your cluster runs **kernel 5.15.0-1092-azure** (> 5.7.7 threshold)
+- MPU may not work on this kernel version
+- Each node has **8 GPUs** - need per-GPU (0-7) process visibility
+
+## Solution Architecture
+
+**Run monitoring on HOST (not in containers)**
+
+```
+┌─────────────────────────────────────────────────────┐
+│  Node (8 GPUs: 0-7)                                 │
+├─────────────────────────────────────────────────────┤
+│                                                      │
+│  ┌─────────────────────────────────┐                │
+│  │ DaemonSet Pod (hostPID=true)    │                │
+│  │  ├─ nvidia-smi on HOST           │                │
+│  │  │  └─ Gets GPU 0-7 assignments │                │
+│  │  │     with HOST PIDs            │                │
+│  │  │                               │                │
+│  │  ├─ Read /proc/<pid>/cgroup     │                │
+│  │  │  └─ Map PID → Container ID   │                │
+│  │  │                               │                │
+│  │  └─ Query Kubernetes API         │                │
+│  │     └─ Map Container → Pod      │                │
+│  └─────────────────────────────────┘                │
+│                                                      │
+│  ┌───────┐ ┌───────┐ ┌───────┐                     │
+│  │Pod A  │ │Pod B  │ │Pod C  │                     │
+│  │GPU 0  │ │GPU 1  │ │GPU 2  │  ...                │
+│  └───────┘ └───────┘ └───────┘                     │
+└─────────────────────────────────────────────────────┘
+```
+
+## Files
+
+- **`examples/gpu_process_tracker_dcgm.py`** - Per-GPU tracker library
+- **`deploy/gpu-process-tracker-daemonset.yaml`** - DaemonSet deployment
+- **`deploy/deploy_gpu_tracker.sh`** - Deployment script
+
+## Quick Start
+
+### 1. Deploy Tracker DaemonSet
+
+```bash
+cd dynamo/tests/fault_tolerance/hardware/fault-injection-service/deploy
+./deploy_gpu_tracker.sh
+```
+
+This deploys a DaemonSet on all GPU nodes with `hostPID` access.
+
+### 2. Query GPU Processes
+
+**See all processes on a node:**
+```bash
+# Get pod on specific node
+NODE_POD=$(kubectl get pods -n dynamo-oviya -l app=gpu-process-tracker \
+  --field-selector spec.nodeName=aks-a100a-36888584-vmss000000 \
+  -o name | head -1)
+
+# Query all GPU processes
+kubectl exec -n dynamo-oviya $NODE_POD -- python3 /usr/local/bin/gpu_tracker.py
+```
+
+**Output:**
+```
+Node: aks-a100a-36888584-vmss000000
+Total GPU Processes: 8
+================================================================================
+
+GPU 0: 1 process(es)
+--------------------------------------------------------------------------------
+  PID 12345: python3
+    Memory: 2048 MB
+    UUID: GPU-abc123...
+    Pod: dynamo-oviya/vllm-worker-0
+
+GPU 1: 1 process(es)
+--------------------------------------------------------------------------------
+  PID 12346: python3
+    Memory: 2048 MB
+    UUID: GPU-def456...
+    Pod: dynamo-oviya/vllm-worker-1
+
+...
+```
+
+**See only GPU 3:**
+```bash
+kubectl exec -n dynamo-oviya $NODE_POD -- \
+  python3 /usr/local/bin/gpu_tracker.py --gpu 3
+```
+
+**Get JSON output:**
+```bash
+kubectl exec -n dynamo-oviya $NODE_POD -- \
+  python3 /usr/local/bin/gpu_tracker.py --format json
+```
+
+### 3. Use in Your Code
+
+**Python API (for fault injection tests):**
+
+```python
+from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu
+
+# Get processes organized by GPU (0-7)
+processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace="dynamo-oviya")
+
+# Example: Find which GPU has expert X
+for gpu_idx, processes in processes_by_gpu.items():
+    for proc in processes:
+        if proc.experts and 42 in proc.experts:
+            print(f"Expert 42 is on GPU {gpu_idx} in pod {proc.pod_name}")
+            
+            # Inject fault on this specific GPU
+            inject_xid_on_gpu(
+                node=proc.node_name,
+                gpu_index=gpu_idx,
+                xid_type="79"
+            )
+```
+
+**Integration with existing `map_gpu_to_experts.py`:**
+
+```python
+# Replace the existing function
+def get_vllm_gpu_processes(namespace: str = "dynamo-oviya") -> List[GPUProcess]:
+    """Enhanced version that works on kernel 5.15.0"""
+    from gpu_process_tracker_dcgm import get_vllm_gpu_processes_enhanced
+    return get_vllm_gpu_processes_enhanced(namespace)
+```
+
+## Per-GPU Inventory Object
+
+```python
+@dataclass
+class GPUNodeInventory:
+    node_name: str
+    gpus: Dict[int, List[GPUProcess]]  # GPU 0-7 -> processes
+    
+    def get_gpu_processes(self, gpu_index: int) -> List[GPUProcess]
+    def total_processes(self) -> int
+
+@dataclass
+class GPUProcess:
+    pid: int              # Host PID
+    gpu_index: int        # GPU 0-7 on this node
+    gpu_uuid: str
+    memory_mb: int
+    process_name: str
+    pod_name: str         # Kubernetes pod
+    pod_namespace: str
+    node_name: str
+    dp_rank: int          # Data parallel rank
+    experts: List[int]    # MoE expert IDs
+```
+
+## Example: Target Specific GPU for Fault Injection
+
+```python
+#!/usr/bin/env python3
+"""
+Example: Inject XID fault on GPU running specific expert
+"""
+
+from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu
+from cuda_fault_injection import inject_xid
+
+def inject_fault_on_expert(target_expert_id: int, namespace: str = "dynamo-oviya"):
+    """Inject GPU fault on the GPU running specific expert"""
+    
+    # Get per-GPU process inventory
+    processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace)
+    
+    # Find target GPU
+    target_gpu = None
+    target_node = None
+    target_pod = None
+    
+    for gpu_idx, processes in processes_by_gpu.items():
+        for proc in processes:
+            if proc.experts and target_expert_id in proc.experts:
+                target_gpu = gpu_idx
+                target_node = proc.node_name
+                target_pod = proc.pod_name
+                print(f"Found expert {target_expert_id}:")
+                print(f"  Node: {target_node}")
+                print(f"  GPU: {target_gpu}")
+                print(f"  Pod: {target_pod}")
+                break
+        if target_gpu is not None:
+            break
+    
+    if target_gpu is None:
+        raise ValueError(f"Expert {target_expert_id} not found")
+    
+    # Inject XID on specific GPU
+    print(f"\nInjecting XID 79 on GPU {target_gpu}...")
+    inject_xid(
+        node_name=target_node,
+        gpu_index=target_gpu,
+        xid_type="79",
+        namespace=namespace
+    )
+    
+    print("Fault injected successfully")
+    
+    return target_node, target_gpu, target_pod
+
+
+if __name__ == "__main__":
+    import sys
+    
+    if len(sys.argv) < 2:
+        print("Usage: inject_fault_on_expert.py <expert_id>")
+        sys.exit(1)
+    
+    expert_id = int(sys.argv[1])
+    inject_fault_on_expert(expert_id)
+```
+
+## Comparison: MPU vs This Solution
+
+| Feature | MPU | This Solution |
+|---------|-----|---------------|
+| Kernel module | ✅ Required | ❌ Not needed |
+| Kernel 5.15 support | ❌ Uncertain | ✅ Works |
+| Per-GPU tracking | ✅ Yes (0-7) | ✅ Yes (0-7) |
+| hostPID required | ❌ No | ✅ Yes (DaemonSet only) |
+| Works in containers | ✅ Yes | ⚠️ No (needs host access) |
+| Maintenance | ⚠️ Per-kernel rebuild | ✅ No maintenance |
+| Security | ✅ No privileged access | ⚠️ DaemonSet needs privileges |
+
+## When to Use Each
+
+**Use MPU if:**
+- Kernel < 5.7.7 or tested version (4.15, 4.19, 5.14.0-404)
+- Need nvidia-smi to work inside containers
+- Can maintain kernel modules
+
+**Use This Solution if:**
+- Kernel 5.15+ (like your AKS cluster)
+- Can run DaemonSet with hostPID
+- Want zero-maintenance solution
+- Already have DCGM in stack
+
+## Troubleshooting
+
+**Q: DaemonSet pods not starting?**
+```bash
+# Check node selector matches your GPU nodes
+kubectl get nodes --show-labels | grep nvidia
+
+# Update DaemonSet nodeSelector if needed
+kubectl edit daemonset gpu-process-tracker -n dynamo-oviya
+```
+
+**Q: No processes showing up?**
+```bash
+# Verify nvidia-smi works on host
+kubectl exec -n dynamo-oviya $NODE_POD -- nvidia-smi
+
+# Check if pods are using GPUs
+kubectl exec -n dynamo-oviya $NODE_POD -- \
+  nvidia-smi --query-compute-apps=pid,process_name --format=csv
+```
+
+**Q: Pod metadata not enriched?**
+```bash
+# Verify Kubernetes API access
+kubectl exec -n dynamo-oviya $NODE_POD -- \
+  kubectl get pods -n dynamo-oviya
+
+# Check ServiceAccount permissions
+kubectl describe sa default -n dynamo-oviya
+```
+
+## Future Enhancements
+
+- [ ] Prometheus metrics export (per-GPU utilization)
+- [ ] gRPC API for real-time queries
+- [ ] Expert routing map integration
+- [ ] Automatic fault injection targets
+- [ ] Historical tracking (time-series DB)
+
+
@@ -0,0 +1,22 @@
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+.git
+.gitignore
+*.md
+README.md
+.venv
+venv
+env
+*.egg-info
+.pytest_cache
+.coverage
+htmlcov
+*.log
+deploy/
+examples/
+client/
+
@@ -0,0 +1,48 @@
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Virtual environments
+venv/
+env/
+ENV/
+.venv
+
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+*.log
+
+# Docker
+*.tar
+*.tar.gz
+
+# Kubernetes
+*.kubeconfig
+