Skip to content

Commit 84483a2

Browse files
authored
[None][doc] update docs for EPLB (#9166)
Signed-off-by: Dongxu Yang <78518666+dongxuy04@users.noreply.github.com>
1 parent 25bd2e6 commit 84483a2

File tree

1 file changed

+13
-0
lines changed

1 file changed

+13
-0
lines changed

examples/wide_ep/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,19 @@ For GB200 NVL72, to make sure that Multi-Node NVLink (MNNVL) is correctly setup,
3131

3232
For more information on NVIDIA IMEX service for NVLink networks, refer to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.
3333

34+
#### Coherent Driver-Based Memory Management (CDMM)
35+
36+
Starting from R580 Driver, [Coherent Driver-Based Memory Management (CDMM)](https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html#hardware-software-support) for GB200 platforms is introduced. With CDMM, the driver manages GPU memory instead of the OS. CDMM avoids OS onlining of the GPU memory and the exposing of the GPU memory as a NUMA node to the OS. In Wide-EP, online EPLB need host threads be able to access the GPU memory to do the weights update.
37+
38+
When CDMM mode is off, GPU memory are exposed as NUMA nodes, so no additional prerequisites is required.
39+
40+
When CDMM mode is on, GPU memory doesn't exist in NUMA nodes, in that case, if online EPLB is needed, [GDRCopy](https://github.com/NVIDIA/gdrcopy?tab=readme-ov-file#build-and-installation) needs to be installed.
41+
42+
When GDRCopy is installed and the kernel module is loaded, you should be able to see the device file `/dev/gdrdrv` and kernel module `gdrdrv` by `lsmod`. The device file needs to be mapped into the container.
43+
44+
* For docker, this can be done by adding a device mapping like `--device=/dev/gdrdrv:/dev/gdrdrv`.
45+
* For slurm with enroot, `--container-mounts="/dev/gdrdrv:/dev/gdrdrv"` needs to be added when starting containers and environment variable `export ENROOT_ALLOW_DEV=yes` needs to be set.
46+
3447
### Configurations
3548

3649
An example yaml file to enable wide EP:

0 commit comments

Comments
 (0)