You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/software/communication/nccl.md
+10Lines changed: 10 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,6 +22,16 @@ While the container engine sets these automatically when using the NCCL hook, th
22
22
23
23
[_Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms_](https://arxiv.org/abs/2507.04786v2) contains detailed information about NCCL algorithms and protocols, which can be helpful for deciding if your application could benefit from an alternative configuration.
24
24
25
+
!!! warning "NCCL watchdog timeout or hanging process"
26
+
In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error.
27
+
In this scenario, we recommend disabling Slingshot eager messages with the following workaround:
28
+
```bash
29
+
# Disable eager messages to avoid NCCL timeouts
30
+
export FI_CXI_RDZV_GET_MIN=0
31
+
export FI_CXI_RDZV_THRESHOLD=0
32
+
export FI_CXI_RDZV_EAGER_SIZE=0
33
+
```
34
+
25
35
!!! warning "Using NCCL with uenvs"
26
36
The environment variables listed above are not set automatically when using uenvs.
0 commit comments