You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intermediate_source/monarch_titan_distributed_tutorial.rst
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -68,6 +68,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
68
68
defcreate_slurm_job(
69
69
mesh_name: str,
70
70
num_nodes: int,
71
+
gpus_per_node: int,
71
72
time_limit: str="06:00:00"
72
73
) -> SlurmJob:
73
74
"""
@@ -76,6 +77,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
76
77
A JobTrait can consist of multiple meshes, and
77
78
Monarch allows for re-attaching to ongoing jobs.
78
79
num_nodes: Number of nodes allocated per mesh
80
+
gpus_per_node: Number of GPUs per node in the mesh
79
81
80
82
Note: SlurmJob is just one instance of a Monarch scheduler interface.
81
83
Consult the JobTrait documentation to find one that's right for your usecase.
@@ -85,6 +87,7 @@ The JobTrait design pattern allows for interfacing with custom schedulers, such
85
87
meshes={mesh_name: num_nodes},
86
88
job_name=default_job_name,
87
89
time_limit=time_limit,
90
+
gpus_per_nodes=gpus_per_node,
88
91
# ... additional args can be passed here
89
92
)
90
93
@@ -202,23 +205,26 @@ but here are some common usages:
202
205
try:
203
206
# where mesh0 is 4 nodes * 8 GPUs
204
207
proc_mesh = mesh0.spawn_procs({"gpus": 32})
205
-
trainer_actor= proc_mesh.spawn(...)
208
+
trainer_actors= proc_mesh.spawn(...)
206
209
207
210
# Call on all ranks
208
-
awaittrainer_actor.ping_rank.call()
211
+
awaittrainer_actors.ping_rank.call()
209
212
210
213
# Call-and-forget on all ranks
211
-
trainer_actor.ping_rank.broadcast()
214
+
trainer_actors.ping_rank.broadcast()
212
215
213
216
# Call on ONE random rank
214
-
await trainer_actor.ping_rank.choose()
217
+
await trainer_actors.ping_rank.choose()
218
+
219
+
# Call on the first 3 ranks of node 0
220
+
215
221
216
222
exceptExceptionas e:
217
223
# handle SupervisionEvents from remote actor failures
218
224
pass
219
225
220
226
Remote actor endpoints can also utilize Python native breakpoints, enabling interactive debugging sessions.
221
-
For a complete deep-dive into Monarch debuggers, `refer to the documentation <https://meta-pytorch.org/monarch/generated/examples/debugging.html>`_.
227
+
For a complete deep-dive into Monarch debuggers, please `refer to the documentation <https://meta-pytorch.org/monarch/generated/examples/debugging.html>`_.
222
228
223
229
.. code-block:: python
224
230
@@ -266,7 +272,7 @@ and some of the training hyperparameters.
266
272
gpus_per_node: int=8
267
273
268
274
TorchTitan uses a JobConfig object to control all aspects of training.
269
-
Here we create a function that builds this configuration from our RunParams.
275
+
Here we create a function that parses this configuration from our RunParams.
270
276
271
277
.. code-block:: python
272
278
@@ -338,14 +344,13 @@ This is where Monarch's power becomes most apparent.
338
344
try:
339
345
# 1. Create a SLURM job with N nodes
340
346
# This leverages Monarch to reserve a persistent machine allocation
0 commit comments