|
| 1 | +# Per-GPU Process Tracking (Without MPU) |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +**nvidia-smi inside containers doesn't show process lists** due to PID namespace isolation. This is the exact problem MPU (https://github.com/matpool/mpu) solves, but: |
| 6 | + |
| 7 | +- MPU requires kernel module compilation |
| 8 | +- Your cluster runs **kernel 5.15.0-1092-azure** (> 5.7.7 threshold) |
| 9 | +- MPU may not work on this kernel version |
| 10 | +- Each node has **8 GPUs** - need per-GPU (0-7) process visibility |
| 11 | + |
| 12 | +## Solution Architecture |
| 13 | + |
| 14 | +**Run monitoring on HOST (not in containers)** |
| 15 | + |
| 16 | +``` |
| 17 | +┌─────────────────────────────────────────────────────┐ |
| 18 | +│ Node (8 GPUs: 0-7) │ |
| 19 | +├─────────────────────────────────────────────────────┤ |
| 20 | +│ │ |
| 21 | +│ ┌─────────────────────────────────┐ │ |
| 22 | +│ │ DaemonSet Pod (hostPID=true) │ │ |
| 23 | +│ │ ├─ nvidia-smi on HOST │ │ |
| 24 | +│ │ │ └─ Gets GPU 0-7 assignments │ │ |
| 25 | +│ │ │ with HOST PIDs │ │ |
| 26 | +│ │ │ │ │ |
| 27 | +│ │ ├─ Read /proc/<pid>/cgroup │ │ |
| 28 | +│ │ │ └─ Map PID → Container ID │ │ |
| 29 | +│ │ │ │ │ |
| 30 | +│ │ └─ Query Kubernetes API │ │ |
| 31 | +│ │ └─ Map Container → Pod │ │ |
| 32 | +│ └─────────────────────────────────┘ │ |
| 33 | +│ │ |
| 34 | +│ ┌───────┐ ┌───────┐ ┌───────┐ │ |
| 35 | +│ │Pod A │ │Pod B │ │Pod C │ │ |
| 36 | +│ │GPU 0 │ │GPU 1 │ │GPU 2 │ ... │ |
| 37 | +│ └───────┘ └───────┘ └───────┘ │ |
| 38 | +└─────────────────────────────────────────────────────┘ |
| 39 | +``` |
| 40 | + |
| 41 | +## Files |
| 42 | + |
| 43 | +- **`examples/gpu_process_tracker_dcgm.py`** - Per-GPU tracker library |
| 44 | +- **`deploy/gpu-process-tracker-daemonset.yaml`** - DaemonSet deployment |
| 45 | +- **`deploy/deploy_gpu_tracker.sh`** - Deployment script |
| 46 | + |
| 47 | +## Quick Start |
| 48 | + |
| 49 | +### 1. Deploy Tracker DaemonSet |
| 50 | + |
| 51 | +```bash |
| 52 | +cd dynamo/tests/fault_tolerance/hardware/fault-injection-service/deploy |
| 53 | +./deploy_gpu_tracker.sh |
| 54 | +``` |
| 55 | + |
| 56 | +This deploys a DaemonSet on all GPU nodes with `hostPID` access. |
| 57 | + |
| 58 | +### 2. Query GPU Processes |
| 59 | + |
| 60 | +**See all processes on a node:** |
| 61 | +```bash |
| 62 | +# Get pod on specific node |
| 63 | +NODE_POD=$(kubectl get pods -n dynamo-oviya -l app=gpu-process-tracker \ |
| 64 | + --field-selector spec.nodeName=aks-a100a-36888584-vmss000000 \ |
| 65 | + -o name | head -1) |
| 66 | + |
| 67 | +# Query all GPU processes |
| 68 | +kubectl exec -n dynamo-oviya $NODE_POD -- python3 /usr/local/bin/gpu_tracker.py |
| 69 | +``` |
| 70 | + |
| 71 | +**Output:** |
| 72 | +``` |
| 73 | +Node: aks-a100a-36888584-vmss000000 |
| 74 | +Total GPU Processes: 8 |
| 75 | +================================================================================ |
| 76 | +
|
| 77 | +GPU 0: 1 process(es) |
| 78 | +-------------------------------------------------------------------------------- |
| 79 | + PID 12345: python3 |
| 80 | + Memory: 2048 MB |
| 81 | + UUID: GPU-abc123... |
| 82 | + Pod: dynamo-oviya/vllm-worker-0 |
| 83 | +
|
| 84 | +GPU 1: 1 process(es) |
| 85 | +-------------------------------------------------------------------------------- |
| 86 | + PID 12346: python3 |
| 87 | + Memory: 2048 MB |
| 88 | + UUID: GPU-def456... |
| 89 | + Pod: dynamo-oviya/vllm-worker-1 |
| 90 | +
|
| 91 | +... |
| 92 | +``` |
| 93 | + |
| 94 | +**See only GPU 3:** |
| 95 | +```bash |
| 96 | +kubectl exec -n dynamo-oviya $NODE_POD -- \ |
| 97 | + python3 /usr/local/bin/gpu_tracker.py --gpu 3 |
| 98 | +``` |
| 99 | + |
| 100 | +**Get JSON output:** |
| 101 | +```bash |
| 102 | +kubectl exec -n dynamo-oviya $NODE_POD -- \ |
| 103 | + python3 /usr/local/bin/gpu_tracker.py --format json |
| 104 | +``` |
| 105 | + |
| 106 | +### 3. Use in Your Code |
| 107 | + |
| 108 | +**Python API (for fault injection tests):** |
| 109 | + |
| 110 | +```python |
| 111 | +from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu |
| 112 | + |
| 113 | +# Get processes organized by GPU (0-7) |
| 114 | +processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace="dynamo-oviya") |
| 115 | + |
| 116 | +# Example: Find which GPU has expert X |
| 117 | +for gpu_idx, processes in processes_by_gpu.items(): |
| 118 | + for proc in processes: |
| 119 | + if proc.experts and 42 in proc.experts: |
| 120 | + print(f"Expert 42 is on GPU {gpu_idx} in pod {proc.pod_name}") |
| 121 | + |
| 122 | + # Inject fault on this specific GPU |
| 123 | + inject_xid_on_gpu( |
| 124 | + node=proc.node_name, |
| 125 | + gpu_index=gpu_idx, |
| 126 | + xid_type="79" |
| 127 | + ) |
| 128 | +``` |
| 129 | + |
| 130 | +**Integration with existing `map_gpu_to_experts.py`:** |
| 131 | + |
| 132 | +```python |
| 133 | +# Replace the existing function |
| 134 | +def get_vllm_gpu_processes(namespace: str = "dynamo-oviya") -> List[GPUProcess]: |
| 135 | + """Enhanced version that works on kernel 5.15.0""" |
| 136 | + from gpu_process_tracker_dcgm import get_vllm_gpu_processes_enhanced |
| 137 | + return get_vllm_gpu_processes_enhanced(namespace) |
| 138 | +``` |
| 139 | + |
| 140 | +## Per-GPU Inventory Object |
| 141 | + |
| 142 | +```python |
| 143 | +@dataclass |
| 144 | +class GPUNodeInventory: |
| 145 | + node_name: str |
| 146 | + gpus: Dict[int, List[GPUProcess]] # GPU 0-7 -> processes |
| 147 | + |
| 148 | + def get_gpu_processes(self, gpu_index: int) -> List[GPUProcess] |
| 149 | + def total_processes(self) -> int |
| 150 | + |
| 151 | +@dataclass |
| 152 | +class GPUProcess: |
| 153 | + pid: int # Host PID |
| 154 | + gpu_index: int # GPU 0-7 on this node |
| 155 | + gpu_uuid: str |
| 156 | + memory_mb: int |
| 157 | + process_name: str |
| 158 | + pod_name: str # Kubernetes pod |
| 159 | + pod_namespace: str |
| 160 | + node_name: str |
| 161 | + dp_rank: int # Data parallel rank |
| 162 | + experts: List[int] # MoE expert IDs |
| 163 | +``` |
| 164 | + |
| 165 | +## Example: Target Specific GPU for Fault Injection |
| 166 | + |
| 167 | +```python |
| 168 | +#!/usr/bin/env python3 |
| 169 | +""" |
| 170 | +Example: Inject XID fault on GPU running specific expert |
| 171 | +""" |
| 172 | + |
| 173 | +from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu |
| 174 | +from cuda_fault_injection import inject_xid |
| 175 | + |
| 176 | +def inject_fault_on_expert(target_expert_id: int, namespace: str = "dynamo-oviya"): |
| 177 | + """Inject GPU fault on the GPU running specific expert""" |
| 178 | + |
| 179 | + # Get per-GPU process inventory |
| 180 | + processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace) |
| 181 | + |
| 182 | + # Find target GPU |
| 183 | + target_gpu = None |
| 184 | + target_node = None |
| 185 | + target_pod = None |
| 186 | + |
| 187 | + for gpu_idx, processes in processes_by_gpu.items(): |
| 188 | + for proc in processes: |
| 189 | + if proc.experts and target_expert_id in proc.experts: |
| 190 | + target_gpu = gpu_idx |
| 191 | + target_node = proc.node_name |
| 192 | + target_pod = proc.pod_name |
| 193 | + print(f"Found expert {target_expert_id}:") |
| 194 | + print(f" Node: {target_node}") |
| 195 | + print(f" GPU: {target_gpu}") |
| 196 | + print(f" Pod: {target_pod}") |
| 197 | + break |
| 198 | + if target_gpu is not None: |
| 199 | + break |
| 200 | + |
| 201 | + if target_gpu is None: |
| 202 | + raise ValueError(f"Expert {target_expert_id} not found") |
| 203 | + |
| 204 | + # Inject XID on specific GPU |
| 205 | + print(f"\nInjecting XID 79 on GPU {target_gpu}...") |
| 206 | + inject_xid( |
| 207 | + node_name=target_node, |
| 208 | + gpu_index=target_gpu, |
| 209 | + xid_type="79", |
| 210 | + namespace=namespace |
| 211 | + ) |
| 212 | + |
| 213 | + print("Fault injected successfully") |
| 214 | + |
| 215 | + return target_node, target_gpu, target_pod |
| 216 | + |
| 217 | + |
| 218 | +if __name__ == "__main__": |
| 219 | + import sys |
| 220 | + |
| 221 | + if len(sys.argv) < 2: |
| 222 | + print("Usage: inject_fault_on_expert.py <expert_id>") |
| 223 | + sys.exit(1) |
| 224 | + |
| 225 | + expert_id = int(sys.argv[1]) |
| 226 | + inject_fault_on_expert(expert_id) |
| 227 | +``` |
| 228 | + |
| 229 | +## Comparison: MPU vs This Solution |
| 230 | + |
| 231 | +| Feature | MPU | This Solution | |
| 232 | +|---------|-----|---------------| |
| 233 | +| Kernel module | ✅ Required | ❌ Not needed | |
| 234 | +| Kernel 5.15 support | ❌ Uncertain | ✅ Works | |
| 235 | +| Per-GPU tracking | ✅ Yes (0-7) | ✅ Yes (0-7) | |
| 236 | +| hostPID required | ❌ No | ✅ Yes (DaemonSet only) | |
| 237 | +| Works in containers | ✅ Yes | ⚠️ No (needs host access) | |
| 238 | +| Maintenance | ⚠️ Per-kernel rebuild | ✅ No maintenance | |
| 239 | +| Security | ✅ No privileged access | ⚠️ DaemonSet needs privileges | |
| 240 | + |
| 241 | +## When to Use Each |
| 242 | + |
| 243 | +**Use MPU if:** |
| 244 | +- Kernel < 5.7.7 or tested version (4.15, 4.19, 5.14.0-404) |
| 245 | +- Need nvidia-smi to work inside containers |
| 246 | +- Can maintain kernel modules |
| 247 | + |
| 248 | +**Use This Solution if:** |
| 249 | +- Kernel 5.15+ (like your AKS cluster) |
| 250 | +- Can run DaemonSet with hostPID |
| 251 | +- Want zero-maintenance solution |
| 252 | +- Already have DCGM in stack |
| 253 | + |
| 254 | +## Troubleshooting |
| 255 | + |
| 256 | +**Q: DaemonSet pods not starting?** |
| 257 | +```bash |
| 258 | +# Check node selector matches your GPU nodes |
| 259 | +kubectl get nodes --show-labels | grep nvidia |
| 260 | + |
| 261 | +# Update DaemonSet nodeSelector if needed |
| 262 | +kubectl edit daemonset gpu-process-tracker -n dynamo-oviya |
| 263 | +``` |
| 264 | + |
| 265 | +**Q: No processes showing up?** |
| 266 | +```bash |
| 267 | +# Verify nvidia-smi works on host |
| 268 | +kubectl exec -n dynamo-oviya $NODE_POD -- nvidia-smi |
| 269 | + |
| 270 | +# Check if pods are using GPUs |
| 271 | +kubectl exec -n dynamo-oviya $NODE_POD -- \ |
| 272 | + nvidia-smi --query-compute-apps=pid,process_name --format=csv |
| 273 | +``` |
| 274 | + |
| 275 | +**Q: Pod metadata not enriched?** |
| 276 | +```bash |
| 277 | +# Verify Kubernetes API access |
| 278 | +kubectl exec -n dynamo-oviya $NODE_POD -- \ |
| 279 | + kubectl get pods -n dynamo-oviya |
| 280 | + |
| 281 | +# Check ServiceAccount permissions |
| 282 | +kubectl describe sa default -n dynamo-oviya |
| 283 | +``` |
| 284 | + |
| 285 | +## Future Enhancements |
| 286 | + |
| 287 | +- [ ] Prometheus metrics export (per-GPU utilization) |
| 288 | +- [ ] gRPC API for real-time queries |
| 289 | +- [ ] Expert routing map integration |
| 290 | +- [ ] Automatic fault injection targets |
| 291 | +- [ ] Historical tracking (time-series DB) |
| 292 | + |
| 293 | + |
0 commit comments