Skip to content

Commit d50a27d

Browse files
committed
style: apply black, clang-format, and ruff formatting fixes
Signed-off-by: Oviya Seeniraj <oseeniraj@nvidia.com>
1 parent fc60cf4 commit d50a27d

File tree

75 files changed

+19873
-68
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

75 files changed

+19873
-68
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Simulating GPU failure on node aks-a100b-22138447-vmss000001
2+
[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Labeled node aks-a100b-22138447-vmss000001 with gpu-health=failed
3+
[TEST] 2025-10-20T14:02:57 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Cordoning node aks-a100b-22138447-vmss000001
4+
[TEST] 2025-10-20T14:03:06 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Waiting for pod 5e945c5c-6ad5-4420-ba54-0322db4fe06b to be rescheduled (timeout=300s)
5+
[TEST] 2025-10-20T14:08:11 INFO tests.fault_tolerance.hardware.test_gpu_health_check: Uncordoning node aks-a100b-22138447-vmss000001

test_network_partition_frontend_to_worker_with_recovery/test.log.txt

Whitespace-only changes.

test_network_partition_worker_to_nats_with_recovery/test.log.txt

Whitespace-only changes.
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
# Per-GPU Process Tracking (Without MPU)
2+
3+
## Problem
4+
5+
**nvidia-smi inside containers doesn't show process lists** due to PID namespace isolation. This is the exact problem MPU (https://github.com/matpool/mpu) solves, but:
6+
7+
- MPU requires kernel module compilation
8+
- Your cluster runs **kernel 5.15.0-1092-azure** (> 5.7.7 threshold)
9+
- MPU may not work on this kernel version
10+
- Each node has **8 GPUs** - need per-GPU (0-7) process visibility
11+
12+
## Solution Architecture
13+
14+
**Run monitoring on HOST (not in containers)**
15+
16+
```
17+
┌─────────────────────────────────────────────────────┐
18+
│ Node (8 GPUs: 0-7) │
19+
├─────────────────────────────────────────────────────┤
20+
│ │
21+
│ ┌─────────────────────────────────┐ │
22+
│ │ DaemonSet Pod (hostPID=true) │ │
23+
│ │ ├─ nvidia-smi on HOST │ │
24+
│ │ │ └─ Gets GPU 0-7 assignments │ │
25+
│ │ │ with HOST PIDs │ │
26+
│ │ │ │ │
27+
│ │ ├─ Read /proc/<pid>/cgroup │ │
28+
│ │ │ └─ Map PID → Container ID │ │
29+
│ │ │ │ │
30+
│ │ └─ Query Kubernetes API │ │
31+
│ │ └─ Map Container → Pod │ │
32+
│ └─────────────────────────────────┘ │
33+
│ │
34+
│ ┌───────┐ ┌───────┐ ┌───────┐ │
35+
│ │Pod A │ │Pod B │ │Pod C │ │
36+
│ │GPU 0 │ │GPU 1 │ │GPU 2 │ ... │
37+
│ └───────┘ └───────┘ └───────┘ │
38+
└─────────────────────────────────────────────────────┘
39+
```
40+
41+
## Files
42+
43+
- **`examples/gpu_process_tracker_dcgm.py`** - Per-GPU tracker library
44+
- **`deploy/gpu-process-tracker-daemonset.yaml`** - DaemonSet deployment
45+
- **`deploy/deploy_gpu_tracker.sh`** - Deployment script
46+
47+
## Quick Start
48+
49+
### 1. Deploy Tracker DaemonSet
50+
51+
```bash
52+
cd dynamo/tests/fault_tolerance/hardware/fault-injection-service/deploy
53+
./deploy_gpu_tracker.sh
54+
```
55+
56+
This deploys a DaemonSet on all GPU nodes with `hostPID` access.
57+
58+
### 2. Query GPU Processes
59+
60+
**See all processes on a node:**
61+
```bash
62+
# Get pod on specific node
63+
NODE_POD=$(kubectl get pods -n dynamo-oviya -l app=gpu-process-tracker \
64+
--field-selector spec.nodeName=aks-a100a-36888584-vmss000000 \
65+
-o name | head -1)
66+
67+
# Query all GPU processes
68+
kubectl exec -n dynamo-oviya $NODE_POD -- python3 /usr/local/bin/gpu_tracker.py
69+
```
70+
71+
**Output:**
72+
```
73+
Node: aks-a100a-36888584-vmss000000
74+
Total GPU Processes: 8
75+
================================================================================
76+
77+
GPU 0: 1 process(es)
78+
--------------------------------------------------------------------------------
79+
PID 12345: python3
80+
Memory: 2048 MB
81+
UUID: GPU-abc123...
82+
Pod: dynamo-oviya/vllm-worker-0
83+
84+
GPU 1: 1 process(es)
85+
--------------------------------------------------------------------------------
86+
PID 12346: python3
87+
Memory: 2048 MB
88+
UUID: GPU-def456...
89+
Pod: dynamo-oviya/vllm-worker-1
90+
91+
...
92+
```
93+
94+
**See only GPU 3:**
95+
```bash
96+
kubectl exec -n dynamo-oviya $NODE_POD -- \
97+
python3 /usr/local/bin/gpu_tracker.py --gpu 3
98+
```
99+
100+
**Get JSON output:**
101+
```bash
102+
kubectl exec -n dynamo-oviya $NODE_POD -- \
103+
python3 /usr/local/bin/gpu_tracker.py --format json
104+
```
105+
106+
### 3. Use in Your Code
107+
108+
**Python API (for fault injection tests):**
109+
110+
```python
111+
from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu
112+
113+
# Get processes organized by GPU (0-7)
114+
processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace="dynamo-oviya")
115+
116+
# Example: Find which GPU has expert X
117+
for gpu_idx, processes in processes_by_gpu.items():
118+
for proc in processes:
119+
if proc.experts and 42 in proc.experts:
120+
print(f"Expert 42 is on GPU {gpu_idx} in pod {proc.pod_name}")
121+
122+
# Inject fault on this specific GPU
123+
inject_xid_on_gpu(
124+
node=proc.node_name,
125+
gpu_index=gpu_idx,
126+
xid_type="79"
127+
)
128+
```
129+
130+
**Integration with existing `map_gpu_to_experts.py`:**
131+
132+
```python
133+
# Replace the existing function
134+
def get_vllm_gpu_processes(namespace: str = "dynamo-oviya") -> List[GPUProcess]:
135+
"""Enhanced version that works on kernel 5.15.0"""
136+
from gpu_process_tracker_dcgm import get_vllm_gpu_processes_enhanced
137+
return get_vllm_gpu_processes_enhanced(namespace)
138+
```
139+
140+
## Per-GPU Inventory Object
141+
142+
```python
143+
@dataclass
144+
class GPUNodeInventory:
145+
node_name: str
146+
gpus: Dict[int, List[GPUProcess]] # GPU 0-7 -> processes
147+
148+
def get_gpu_processes(self, gpu_index: int) -> List[GPUProcess]
149+
def total_processes(self) -> int
150+
151+
@dataclass
152+
class GPUProcess:
153+
pid: int # Host PID
154+
gpu_index: int # GPU 0-7 on this node
155+
gpu_uuid: str
156+
memory_mb: int
157+
process_name: str
158+
pod_name: str # Kubernetes pod
159+
pod_namespace: str
160+
node_name: str
161+
dp_rank: int # Data parallel rank
162+
experts: List[int] # MoE expert IDs
163+
```
164+
165+
## Example: Target Specific GPU for Fault Injection
166+
167+
```python
168+
#!/usr/bin/env python3
169+
"""
170+
Example: Inject XID fault on GPU running specific expert
171+
"""
172+
173+
from gpu_process_tracker_dcgm import get_vllm_gpu_processes_per_gpu
174+
from cuda_fault_injection import inject_xid
175+
176+
def inject_fault_on_expert(target_expert_id: int, namespace: str = "dynamo-oviya"):
177+
"""Inject GPU fault on the GPU running specific expert"""
178+
179+
# Get per-GPU process inventory
180+
processes_by_gpu = get_vllm_gpu_processes_per_gpu(namespace)
181+
182+
# Find target GPU
183+
target_gpu = None
184+
target_node = None
185+
target_pod = None
186+
187+
for gpu_idx, processes in processes_by_gpu.items():
188+
for proc in processes:
189+
if proc.experts and target_expert_id in proc.experts:
190+
target_gpu = gpu_idx
191+
target_node = proc.node_name
192+
target_pod = proc.pod_name
193+
print(f"Found expert {target_expert_id}:")
194+
print(f" Node: {target_node}")
195+
print(f" GPU: {target_gpu}")
196+
print(f" Pod: {target_pod}")
197+
break
198+
if target_gpu is not None:
199+
break
200+
201+
if target_gpu is None:
202+
raise ValueError(f"Expert {target_expert_id} not found")
203+
204+
# Inject XID on specific GPU
205+
print(f"\nInjecting XID 79 on GPU {target_gpu}...")
206+
inject_xid(
207+
node_name=target_node,
208+
gpu_index=target_gpu,
209+
xid_type="79",
210+
namespace=namespace
211+
)
212+
213+
print("Fault injected successfully")
214+
215+
return target_node, target_gpu, target_pod
216+
217+
218+
if __name__ == "__main__":
219+
import sys
220+
221+
if len(sys.argv) < 2:
222+
print("Usage: inject_fault_on_expert.py <expert_id>")
223+
sys.exit(1)
224+
225+
expert_id = int(sys.argv[1])
226+
inject_fault_on_expert(expert_id)
227+
```
228+
229+
## Comparison: MPU vs This Solution
230+
231+
| Feature | MPU | This Solution |
232+
|---------|-----|---------------|
233+
| Kernel module | ✅ Required | ❌ Not needed |
234+
| Kernel 5.15 support | ❌ Uncertain | ✅ Works |
235+
| Per-GPU tracking | ✅ Yes (0-7) | ✅ Yes (0-7) |
236+
| hostPID required | ❌ No | ✅ Yes (DaemonSet only) |
237+
| Works in containers | ✅ Yes | ⚠️ No (needs host access) |
238+
| Maintenance | ⚠️ Per-kernel rebuild | ✅ No maintenance |
239+
| Security | ✅ No privileged access | ⚠️ DaemonSet needs privileges |
240+
241+
## When to Use Each
242+
243+
**Use MPU if:**
244+
- Kernel < 5.7.7 or tested version (4.15, 4.19, 5.14.0-404)
245+
- Need nvidia-smi to work inside containers
246+
- Can maintain kernel modules
247+
248+
**Use This Solution if:**
249+
- Kernel 5.15+ (like your AKS cluster)
250+
- Can run DaemonSet with hostPID
251+
- Want zero-maintenance solution
252+
- Already have DCGM in stack
253+
254+
## Troubleshooting
255+
256+
**Q: DaemonSet pods not starting?**
257+
```bash
258+
# Check node selector matches your GPU nodes
259+
kubectl get nodes --show-labels | grep nvidia
260+
261+
# Update DaemonSet nodeSelector if needed
262+
kubectl edit daemonset gpu-process-tracker -n dynamo-oviya
263+
```
264+
265+
**Q: No processes showing up?**
266+
```bash
267+
# Verify nvidia-smi works on host
268+
kubectl exec -n dynamo-oviya $NODE_POD -- nvidia-smi
269+
270+
# Check if pods are using GPUs
271+
kubectl exec -n dynamo-oviya $NODE_POD -- \
272+
nvidia-smi --query-compute-apps=pid,process_name --format=csv
273+
```
274+
275+
**Q: Pod metadata not enriched?**
276+
```bash
277+
# Verify Kubernetes API access
278+
kubectl exec -n dynamo-oviya $NODE_POD -- \
279+
kubectl get pods -n dynamo-oviya
280+
281+
# Check ServiceAccount permissions
282+
kubectl describe sa default -n dynamo-oviya
283+
```
284+
285+
## Future Enhancements
286+
287+
- [ ] Prometheus metrics export (per-GPU utilization)
288+
- [ ] gRPC API for real-time queries
289+
- [ ] Expert routing map integration
290+
- [ ] Automatic fault injection targets
291+
- [ ] Historical tracking (time-series DB)
292+
293+
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
__pycache__
2+
*.pyc
3+
*.pyo
4+
*.pyd
5+
.Python
6+
*.so
7+
.git
8+
.gitignore
9+
*.md
10+
README.md
11+
.venv
12+
venv
13+
env
14+
*.egg-info
15+
.pytest_cache
16+
.coverage
17+
htmlcov
18+
*.log
19+
deploy/
20+
examples/
21+
client/
22+
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual environments
24+
venv/
25+
env/
26+
ENV/
27+
.venv
28+
29+
# IDE
30+
.vscode/
31+
.idea/
32+
*.swp
33+
*.swo
34+
*~
35+
36+
# Testing
37+
.pytest_cache/
38+
.coverage
39+
htmlcov/
40+
*.log
41+
42+
# Docker
43+
*.tar
44+
*.tar.gz
45+
46+
# Kubernetes
47+
*.kubeconfig
48+

0 commit comments

Comments
 (0)