You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add accelerator API to RPC distributed examples: ddp_rpc, parameter_server, rnn (#1371)
* Add rpc/ddp_rpc and rpc/rnn examples to CI
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
* Add accelerator API to RPC distributed examples:
- ddp_rpc
- parameter_server
- rnn
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
* Update requirements for RPC examples to include numpy
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
* Enhance GPU verification and cleanup in DDP RPC example
- Added a function to verify minimum GPU count before execution.
- Updated HybridModel initialization to use rank instead of device.
- Ensured proper cleanup of the process group to avoid resource leaks.
- Added exit message if insufficient GPUs are detected.
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
* - Update torch version in requirements.txt
- Remove CPU execution option since DDP requires 2 GPUs for this example.
- Refine README.md for DDP RPC example clarity and detail
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
---------
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Copy file name to clipboardExpand all lines: distributed/rpc/ddp_rpc/README.md
+4-12Lines changed: 4 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,10 @@
1
1
Distributed DataParallel + Distributed RPC Framework Example
2
2
3
-
The example shows how to combine Distributed DataParallel with the Distributed
4
-
RPC Framework. There are two trainer nodes, 1 master node and 1 parameter
5
-
server in the example.
3
+
This example demonstrates how to combine Distributed DataParallel (DDP) with the Distributed RPC Framework. It requires two trainer nodes (each with a GPU), one master node, and one parameter server.
6
4
7
-
The master node creates an embedding table on the parameter server and drives
8
-
the training loop on the trainers. The model consists of a dense part
9
-
(nn.Linear) replicated on the trainers via Distributed DataParallel and a
10
-
sparse part (nn.EmbeddingBag) which resides on the parameter server. Each
11
-
trainer performs an embedding lookup on the parameter server (using the
12
-
Distributed RPC Framework) and then executes its local nn.Linear module.
13
-
During the backward pass, the gradients for the dense part are aggregated via
14
-
allreduce by DDP and the distributed backward pass updates the parameters for
15
-
the embedding table on the parameter server.
5
+
The master node initializes an embedding table on the parameter server and orchestrates the training loop across the trainers. The model is composed of a dense component (`nn.Linear`), which is replicated on the trainers using DDP, and a sparse component (`nn.EmbeddingBag`), which resides on the parameter server.
6
+
7
+
Each trainer performs embedding lookups on the parameter server via RPC, then processes the results through its local `nn.Linear` module. During the backward pass, DDP aggregates gradients for the dense part using allreduce, while the distributed backward pass updates the embedding table parameters on the parameter server.
0 commit comments