Skip to content

Commit 1a58734

Browse files
committed
small changes
1 parent aa27a80 commit 1a58734

File tree

1 file changed

+15
-10
lines changed

1 file changed

+15
-10
lines changed

intermediate_source/monarch_titan_distributed_tutorial.rst renamed to intermediate_source/monarch_distributed_tutorial.rst

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
6868
def create_slurm_job(
6969
mesh_name: str,
7070
num_nodes: int,
71+
gpus_per_node: int,
7172
time_limit: str = "06:00:00"
7273
) -> SlurmJob:
7374
"""
@@ -76,6 +77,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
7677
A JobTrait can consist of multiple meshes, and
7778
Monarch allows for re-attaching to ongoing jobs.
7879
num_nodes: Number of nodes allocated per mesh
80+
gpus_per_node: Number of GPUs per node in the mesh
7981
8082
Note: SlurmJob is just one instance of a Monarch scheduler interface.
8183
Consult the JobTrait documentation to find one that's right for your usecase.
@@ -85,6 +87,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
8587
meshes={mesh_name: num_nodes},
8688
job_name=default_job_name,
8789
time_limit=time_limit,
90+
gpus_per_nodes=gpus_per_node,
8891
# ... additional args can be passed here
8992
)
9093
@@ -202,23 +205,26 @@ but here are some common usages:
202205
try:
203206
# where mesh0 is 4 nodes * 8 GPUs
204207
proc_mesh = mesh0.spawn_procs({"gpus": 32})
205-
trainer_actor = proc_mesh.spawn(...)
208+
trainer_actors = proc_mesh.spawn(...)
206209
207210
# Call on all ranks
208-
await trainer_actor.ping_rank.call()
211+
await trainer_actors.ping_rank.call()
209212
210213
# Call-and-forget on all ranks
211-
trainer_actor.ping_rank.broadcast()
214+
trainer_actors.ping_rank.broadcast()
212215
213216
# Call on ONE random rank
214-
await trainer_actor.ping_rank.choose()
217+
await trainer_actors.ping_rank.choose()
218+
219+
# Call on the first 3 ranks of node 0
220+
await trainer_actors.slice(hosts=0, gpus=slice(0, 3)).ping_rank.call()
215221
216222
except Exception as e:
217223
# handle SupervisionEvents from remote actor failures
218224
pass
219225
220226
Remote actor endpoints can also utilize Python native breakpoints, enabling interactive debugging sessions.
221-
For a complete deep-dive into Monarch debuggers, `refer to the documentation <https://meta-pytorch.org/monarch/generated/examples/debugging.html>`_.
227+
For a complete deep-dive into Monarch debuggers, please `refer to the documentation <https://meta-pytorch.org/monarch/generated/examples/debugging.html>`_.
222228

223229
.. code-block:: python
224230
@@ -266,7 +272,7 @@ and some of the training hyperparameters.
266272
gpus_per_node: int = 8
267273
268274
TorchTitan uses a JobConfig object to control all aspects of training.
269-
Here we create a function that builds this configuration from our RunParams.
275+
Here we create a function that parses this configuration from our RunParams.
270276

271277
.. code-block:: python
272278
@@ -338,14 +344,13 @@ This is where Monarch's power becomes most apparent.
338344
try:
339345
# 1. Create a SLURM job with N nodes
340346
# This leverages Monarch to reserve a persistent machine allocation
341-
slurm_job = create_slurm_job(mesh_name, RunParams.num_nodes)
347+
slurm_job = create_slurm_job(mesh_name, RunParams.num_nodes, RunParams.gpus_per_node)
342348
job_state = slurm_job.state()
343349
344350
# 2. Create a process mesh on the machine allocation
345351
# This creates one process per GPU across all allocated nodes
346352
logger.info("Creating process mesh...")
347-
total_gpus = RunParams.gpus_per_node * RunParams.num_nodes
348-
proc_mesh = job_state.mesh0.spawn_procs({"gpus": total_gpus})
353+
proc_mesh = job_state.mesh0.spawn_procs({"gpus": RunParams.gpus_per_node})
349354
350355
# 3. Configure remote logging behavior
351356
# - stream_to_client: Forward all remote logs to your local console
@@ -435,7 +440,7 @@ Finally, we tie everything together in a main function that kicks off the workfl
435440
Conclusion
436441
-----------
437442

438-
Congrats! In this tutorial, you learned how to combine Monarch's actor framework with
443+
Congrats! In this tutorial, you learned how to apply Monarch's actor framework with
439444
TorchTitan for scalable distributed training.
440445

441446
**Further Reading**

0 commit comments

Comments
 (0)