Merge: [SE3Transformer/DGLPyT] Update container and fix benchmarking

nv-kkudrynski · nv-kkudrynski · commit acecffe16f03 · 2023-02-24T05:28:37.000-08:00
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile b/DGLPyTorch/DrugDiscovery/SE3Transformer/Dockerfile
@@ -24,7 +24,7 @@
 # run docker daemon with --default-runtime=nvidia for GPU detection during build
 # multistage build for DGL with CUDA and FP16
 
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.01-py3
 
 FROM ${FROM_IMAGE_NAME} AS dgl_builder
 
@@ -33,7 +33,7 @@ RUN apt-get update \
     && apt-get install -y git build-essential python3-dev make cmake \
     && rm -rf /var/lib/apt/lists/*
 WORKDIR /dgl
-RUN git clone --branch 0.9.0 --recurse-submodules --depth 1 https://github.com/dmlc/dgl.git .
+RUN git clone --branch 1.0.0 --recurse-submodules --depth 1 https://github.com/dmlc/dgl.git .
 WORKDIR build
 RUN export NCCL_ROOT=/usr \
     && cmake .. -GNinja -DCMAKE_BUILD_TYPE=Release \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md b/DGLPyTorch/DrugDiscovery/SE3Transformer/README.md
@@ -252,9 +252,9 @@ The following section lists the requirements that you need to meet in order to s
 
 ### Requirements
 
-This repository contains a Dockerfile which extends the PyTorch 21.07 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+This repository contains a Dockerfile which extends the PyTorch 23.01 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 - [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-- PyTorch 21.07+ NGC container
+- PyTorch 23.01+ NGC container
 - Supported GPUs:
     - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
     - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/)
@@ -290,7 +290,7 @@ To train your model using mixed or TF32 precision with Tensor Cores or FP32, per
 
 4. Start training.
    ```
-   bash scripts/train.sh
+   bash scripts/train.sh  # or scripts/train_multi_gpu.sh
    ```
 
 5. Start inference/predictions.
@@ -474,7 +474,7 @@ The following sections provide details on how we achieved our performance and ac
 
 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
 
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
+Our results were obtained by running the `scripts/train.sh` and `scripts/train_multi_gpu.sh` training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
 
 | GPUs | Batch size / GPU | Absolute error - TF32 | Absolute error - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (mixed precision to TF32) |       
 |:----:|:----------------:|:---------------------:|:--------------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
@@ -484,7 +484,7 @@ Our results were obtained by running the `scripts/train.sh` training script in t
 
 ##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+Our results were obtained by running the `scripts/train.sh` and `scripts/train_multi_gpu.sh` training scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
 
 | GPUs | Batch size / GPU | Absolute error - FP32 | Absolute error - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (mixed precision to FP32) |      
 |:----:|:----------------:|:---------------------:|:--------------------------------:|:--------------------:|:-------------------------------:|:-----------------------------------------------:|
@@ -497,29 +497,29 @@ Our results were obtained by running the `scripts/train.sh` training script in t
 
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
 
-Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five  entire training epochs after a warmup epoch.
+Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 8x A100 80GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five  entire training epochs after a warmup epoch.
 
 |       GPUs       |  Batch size / GPU   | Throughput - TF32 [mol/ms] | Throughput - mixed precision [mol/ms] | Throughput speedup (mixed precision - TF32) | Weak scaling - TF32 | Weak scaling - mixed precision |
 |:----------------:|:-------------------:|:--------------------------:|:-------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
-|        1         |         240         |            2.61            |                 3.35                  |                    1.28x                    |                     |                                |
-|        1         |         120         |            1.94            |                 2.07                  |                    1.07x                    |                     |                                |
-|        8         |         240         |           18.80            |                 23.90                 |                    1.27x                    |        7.20         |              7.13              |
-|        8         |         120         |           14.10            |                 14.52                 |                    1.03x                    |        7.27         |              7.01              |
+|        1         |         240         |            2.59            |                 3.23                  |                    1.25x                    |                     |                                |
+|        1         |         120         |            1.89            |                 1.89                  |                    1.00x                    |                     |                                |
+|        8         |         240         |           18.38            |                 21.42                 |                    1.17x                    |        7.09         |              6.63              |
+|        8         |         120         |           13.23            |                 13.23                 |                    1.00x                    |        7.00         |              7.00              |
 
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 
 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
 
-Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five  entire training epochs after a warmup epoch.
+Our results were obtained by running the `scripts/benchmark_train.sh` and `scripts/benchmark_train_multi_gpu.sh` benchmarking scripts in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs. Performance numbers (in molecules per millisecond) were averaged over five  entire training epochs after a warmup epoch.
 
 |       GPUs       |   Batch size / GPU   | Throughput - FP32 [mol/ms] | Throughput - mixed precision  [mol/ms] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |
 |:----------------:|:--------------------:|:--------------------------:|:--------------------------------------:|:-------------------------------------------:|:-------------------:|:------------------------------:|
-|        1         |         240          |            1.33            |                  2.12                  |                    1.59x                    |                     |                                |
-|        1         |         120          |            1.11            |                  1.45                  |                    1.31x                    |                     |                                |
-|        8         |         240          |            9.32            |                 13.40                  |                    1.44x                    |        7.01         |              6.32              |
-|        8         |         120          |            6.90            |                  8.39                  |                    1.22x                    |        6.21         |              5.79              |
+|        1         |         240          |            1.23            |                  1.91                  |                    1.55x                    |                     |                                |
+|        1         |         120          |            1.01            |                  1.23                  |                    1.22x                    |                     |                                |
+|        8         |         240          |            8.44            |                 11.28                  |                    1.34x                    |         6.8         |              5.90              |
+|        8         |         120          |            6.06            |                  7.36                  |                    1.21x                    |        6.00         |              5.98              |
 
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -530,47 +530,47 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
 
 ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
 
-Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
+Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX A100 with 1x A100 80GB GPU.
 
 AMP
 
 | Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-|    1600    |          13.54          |      121.44      |      118.07      |      119.00      |      366.64      |
-|    800     |          12.63          |      64.11       |      63.78       |      64.37       |      68.19       |
-|    400     |          10.65          |      37.97       |      39.02       |      39.67       |      42.87       |
+|    1600    |          9.71           |      175.2       |      190.2       |      191.8       |      432.4       |
+|    800     |          7.90           |      114.5       |      134.3       |      135.8       |      140.2       |
+|    400     |          7.18           |      75.49       |      108.6       |      109.6       |      113.2       |
 
 TF32
 
 | Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-|    1600    |          8.97           |      180.85      |      178.31      |      178.92      |      375.33      |
-|    800     |          8.86           |      90.76       |      90.77       |      91.11       |      92.96       |
-|    400     |          8.49           |      47.42       |      47.65       |      48.15       |      50.74       |
+|    1600    |          8.19           |      198.2       |      206.8       |      208.5       |      377.0       |
+|    800     |          7.56           |      107.5       |      119.6       |      120.5       |      125.7       |
+|    400     |          6.97           |       59.8       |       75.1       |       75.7       |       81.3       |
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 
 
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
-Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 21.07 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
+Our results were obtained by running the `scripts/benchmark_inference.sh` inferencing benchmarking script in the PyTorch 23.01 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU.
 
 AMP
 
 | Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-|    1600    |          6.59           |      248.02      |      242.11      |      242.62      |      674.60      |
-|    800     |          6.38           |      126.49      |      125.96      |      126.31      |      127.72      |
-|    400     |          5.90           |      68.24       |      68.53       |      69.02       |      70.87       |
+|    1600    |          5.39           |      306.6       |      321.2       |      324.9       |      819.1       |
+|    800     |          4.67           |      179.8       |      201.5       |      203.8       |      213.3       |
+|    400     |          4.25           |      108.2       |      142.0       |      143.0       |      149.0       |
 
 FP32
 
 | Batch size | Throughput Avg [mol/ms] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:-----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-|    1600    |          3.33           |      482.20      |      483.50      |      485.28      |      754.84      |
-|    800     |          3.35           |      239.09      |      242.21      |      243.13      |      244.91      |
-|    400     |          3.27           |      122.68      |      123.60      |      124.18      |      125.85      |
+|    1600    |          3.14           |      510.9       |      518.83      |      521.1       |      808.0       |
+|    800     |          3.10           |      258.7       |      269.4       |      271.1       |      278.9       |
+|    400     |          2.93           |      137.3       |      147.5       |      148.8       |      151.7       |
 
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -580,6 +580,10 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
 
 ### Changelog
 
+February 2023:
+- Upgraded base container
+- Fixed benchmarking code
+
 August 2022:
 - Slight performance improvements
 - Upgraded base container
@@ -604,3 +608,4 @@ August 2021
 ### Known issues
 
 If you encounter `OSError: [Errno 12] Cannot allocate memory` during the Dataloader iterator creation (more precisely during the `fork()`, this is most likely due to the use of the `--precompute_bases` flag. If you cannot add more RAM or Swap to your machine, it is recommended to turn off bases precomputation by removing the `--precompute_bases` flag or using `--precompute_bases false`.
+
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train.sh
@@ -8,7 +8,7 @@ AMP=${2:-true}
 CUDA_VISIBLE_DEVICES=0 python -m se3_transformer.runtime.training \
   --amp "$AMP" \
   --batch_size "$BATCH_SIZE" \
-  --epochs 6 \
+  --epochs 16 \
   --use_layer_norm \
   --norm \
   --save_ckpt_path model_qm9.pth \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh b/DGLPyTorch/DrugDiscovery/SE3Transformer/scripts/benchmark_train_multi_gpu.sh
@@ -9,7 +9,7 @@ python -m torch.distributed.run --nnodes=1 --nproc_per_node=gpu --max_restarts 0
   se3_transformer.runtime.training \
   --amp "$AMP" \
   --batch_size "$BATCH_SIZE" \
-  --epochs 6 \
+  --epochs 16 \
   --use_layer_norm \
   --norm \
   --save_ckpt_path model_qm9.pth \
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/convolution.py
@@ -113,7 +113,7 @@ def __init__(
             nn.Linear(mid_dim, num_freq * channels_in * channels_out, bias=False)
         ]
 
-        self.net = nn.Sequential(*[m for m in modules if m is not None])
+        self.net = torch.jit.script(nn.Sequential(*[m for m in modules if m is not None]))
 
     def forward(self, features: Tensor) -> Tensor:
         return self.net(features)
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/model/layers/norm.py
@@ -32,6 +32,15 @@
 from se3_transformer.model.fiber import Fiber
 
 
+@torch.jit.script
+def clamped_norm(x, clamp: float):
+    return x.norm(p=2, dim=-1, keepdim=True).clamp(min=clamp)
+
+@torch.jit.script
+def rescale(x, norm, new_norm):
+    return x / norm * new_norm
+
+
 class NormSE3(nn.Module):
     """
     Norm-based SE(3)-equivariant nonlinearity.
@@ -63,7 +72,7 @@ def forward(self, features: Dict[str, Tensor], *args, **kwargs) -> Dict[str, Ten
             output = {}
             if hasattr(self, 'group_norm'):
                 # Compute per-degree norms of features
-                norms = [features[str(d)].norm(dim=-1, keepdim=True).clamp(min=self.NORM_CLAMP)
+                norms = [clamped_norm(features[str(d)], self.NORM_CLAMP)
                          for d in self.fiber.degrees]
                 fused_norms = torch.cat(norms, dim=-2)
 
@@ -73,11 +82,11 @@ def forward(self, features: Dict[str, Tensor], *args, **kwargs) -> Dict[str, Ten
 
                 # Scale features to the new norms
                 for norm, new_norm, d in zip(norms, new_norms, self.fiber.degrees):
-                    output[str(d)] = features[str(d)] / norm * new_norm
+                    output[str(d)] = rescale(features[str(d)], norm, new_norm)
             else:
                 for degree, feat in features.items():
-                    norm = feat.norm(dim=-1, keepdim=True).clamp(min=self.NORM_CLAMP)
+                    norm = clamped_norm(feat, self.NORM_CLAMP)
                     new_norm = self.nonlinearity(self.layer_norms[degree](norm.squeeze(-1)).unsqueeze(-1))
-                    output[degree] = new_norm * feat / norm
+                    output[degree] = rescale(new_norm, feat, norm)
 
             return output
diff --git a/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py b/DGLPyTorch/DrugDiscovery/SE3Transformer/se3_transformer/runtime/callbacks.py
@@ -133,6 +133,7 @@ def __init__(self, logger, batch_size: int, warmup_epochs: int = 1, mode: str =
 
     def on_batch_start(self):
         if self.epoch >= self.warmup_epochs:
+            torch.cuda.synchronize()
             self.timestamps.append(time.time() * 1000.0)
 
     def _log_perf(self):
@@ -153,7 +154,7 @@ def on_fit_end(self):
     def process_performance_stats(self):
         timestamps = np.asarray(self.timestamps)
         deltas = np.diff(timestamps)
-        throughput = (self.batch_size / deltas).mean()
+        throughput = self.batch_size / deltas.mean()
         stats = {
             f"throughput_{self.mode}": throughput,
             f"latency_{self.mode}_mean": deltas.mean(),

Original file line number	Diff line number	Diff line change
`@@ -113,7 +113,7 @@ def __init__(`
`113`	`113`	`nn.Linear(mid_dim, num_freq * channels_in * channels_out, bias=False)`
`114`	`114`	`]`
`115`	`115`
`116`		`- self.net = nn.Sequential(*[m for m in modules if m is not None])`
	`116`	`+ self.net = torch.jit.script(nn.Sequential(*[m for m in modules if m is not None]))`
`117`	`117`
`118`	`118`	`def forward(self, features: Tensor) -> Tensor:`
`119`	`119`	`return self.net(features)`