Skip to content

Commit 198406f

Browse files
Recommend disabling eager messages to avoid NCCL watchdog timeouts. (#293)
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
1 parent f308078 commit 198406f

File tree

1 file changed

+10
-0
lines changed

1 file changed

+10
-0
lines changed

docs/software/communication/nccl.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,16 @@ While the container engine sets these automatically when using the NCCL hook, th
2222

2323
[_Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms_](https://arxiv.org/abs/2507.04786v2) contains detailed information about NCCL algorithms and protocols, which can be helpful for deciding if your application could benefit from an alternative configuration.
2424

25+
!!! warning "NCCL watchdog timeout or hanging process"
26+
In some cases, still under investigation, NCCL may hang resulting in a stuck process or a watchdog timeout error.
27+
In this scenario, we recommend disabling Slingshot eager messages with the following workaround:
28+
```bash
29+
# Disable eager messages to avoid NCCL timeouts
30+
export FI_CXI_RDZV_GET_MIN=0
31+
export FI_CXI_RDZV_THRESHOLD=0
32+
export FI_CXI_RDZV_EAGER_SIZE=0
33+
```
34+
2535
!!! warning "Using NCCL with uenvs"
2636
The environment variables listed above are not set automatically when using uenvs.
2737

0 commit comments

Comments
 (0)