You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following steps show you how to convert a PyTorch training script to
16
-
utilize SageMaker Distributed Data Parallel (SDP).
16
+
utilize SageMaker's distributed data parallel library.
17
17
18
-
The SDP APIs are designed to be close to PyTorch Distributed Data
19
-
Parallel (DDP) APIs. Please see `SageMaker Distributed Data Parallel
20
-
PyTorch API documentation <http://#>`__ for additional details on each
21
-
API SDP offers for PyTorch.
18
+
The distributed data parallel library APIs are designed to be close to PyTorch Distributed Data
19
+
Parallel (DDP) APIs.
20
+
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ for additional details on how to implement the data parallel library
21
+
API offered for PyTorch.
22
22
23
23
24
-
- First import SDP’s PyTorch client and initialize it. You also import
25
-
the SDP module for distributed training.
24
+
- First import the distributed data parallel library’s PyTorch client and initialize it. You also import
25
+
the distributed data parallel library module for distributed training.
26
26
27
27
.. code:: python
28
28
@@ -33,7 +33,7 @@ API SDP offers for PyTorch.
33
33
dist.init_process_group()
34
34
35
35
36
-
- Pin each GPU to a single SDP process with ``local_rank`` - this
36
+
- Pin each GPU to a single distributed data parallel library process with ``local_rank`` - this
37
37
refers to the relative rank of the process within a given node.
38
38
``smdistributed.dataparallel.torch.get_local_rank()`` API provides
39
39
you the local rank of the device. The leader node will be rank 0, and
@@ -45,12 +45,12 @@ API SDP offers for PyTorch.
45
45
torch.cuda.set_device(dist.get_local_rank())
46
46
47
47
48
-
- Then wrap the PyTorch model with SDP’s DDP.
48
+
- Then wrap the PyTorch model with the distributed data parallel library’s DDP.
49
49
50
50
.. code:: python
51
51
52
52
model =...
53
-
# Wrap model with SDP DistributedDataParallel
53
+
# Wrap model with SageMaker's DistributedDataParallel
54
54
model = DDP(model)
55
55
56
56
@@ -82,17 +82,17 @@ API SDP offers for PyTorch.
82
82
83
83
84
84
All put together, the following is an example PyTorch training script
85
-
you will have for distributed training with SDP:
85
+
you will have for distributed training with the distributed data parallel library:
86
86
87
87
.. code:: python
88
88
89
-
#SDP: Import SDP PyTorch API
89
+
# Import distributed data parallel library PyTorch API
90
90
import smdistributed.dataparallel.torch.distributed as dist
91
91
92
-
#SDP: Import SDP PyTorch DDP
92
+
# Import distributed data parallel library PyTorch DDP
93
93
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel asDDP
94
94
95
-
#SDP: Initialize SDP
95
+
# Initialize distributed data parallel library
96
96
dist.init_process_group()
97
97
98
98
classNet(nn.Module):
@@ -109,25 +109,25 @@ you will have for distributed training with SDP:
109
109
110
110
defmain():
111
111
112
-
#SDP: Scale batch size by world size
112
+
# Scale batch size by world size
113
113
batch_size //= dist.get_world_size() //8
114
114
batch_size =max(batch_size, 1)
115
115
116
116
# Prepare dataset
117
117
train_dataset = torchvision.datasets.MNIST(...)
118
118
119
-
#SDP: Set num_replicas and rank in DistributedSampler
The following steps show you how to convert a TensorFlow 2.x training
16
-
script to utilize SDP.
16
+
script to utilize the distributed data parallel library.
17
17
18
-
The SDP APIs are designed to be close to Horovod APIs. Please see the
19
-
SDP TensorFlow API specification for additional details on each API that
20
-
SDP offers for TensorFlow.
18
+
The distributed data parallel library APIs are designed to be close to Horovod APIs.
19
+
See `SageMaker distributed data parallel TensorFlow examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__ for additional details on how to implement the data parallel library
20
+
API offered for TensorFlow.
21
21
22
-
- First import SDP’s TensorFlow client and initialize it:
22
+
- First import the distributed data parallel library’s TensorFlow client and initialize it:
23
23
24
24
.. code:: python
25
25
@@ -54,7 +54,7 @@ SDP offers for TensorFlow.
54
54
learning_rate = learning_rate * sdp.size()
55
55
56
56
57
-
- Use SDP’s ``DistributedGradientTape`` to optimize AllReduce
57
+
- Use the library’s ``DistributedGradientTape`` to optimize AllReduce
58
58
operations during training. This wraps ``tf.GradientTape``.
59
59
60
60
.. code:: python
@@ -63,7 +63,7 @@ SDP offers for TensorFlow.
63
63
output = model(input)
64
64
loss_value = loss(label, output)
65
65
66
-
#SDP: Wrap tf.GradientTape with SDP's DistributedGradientTape
66
+
# Wrap tf.GradientTape with the library's DistributedGradientTape
67
67
tape = sdp.DistributedGradientTape(tape)
68
68
69
69
@@ -92,23 +92,23 @@ SDP offers for TensorFlow.
92
92
93
93
94
94
All put together, the following is an example TensorFlow2 training
95
-
script you will have for distributed training with SDP.
95
+
script you will have for distributed training with the library.
96
96
97
97
.. code:: python
98
98
99
99
import tensorflow as tf
100
100
101
-
#SDP: Import SDP TF API
101
+
# Import the library's TF API
102
102
import smdistributed.dataparallel.tensorflow as sdp
0 commit comments