You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -169,7 +166,7 @@ The following section lists the requirements that you need to meet in order to s
169
166
170
167
This repository contains Dockerfile, which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
@@ -371,7 +368,7 @@ The [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/
371
368
372
369
### Benchmarking
373
370
374
-
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
371
+
The following section shows how to run benchmarks measuring the model performance in training and inference modes. Note that the first 3 steps of each epoch are not used in the throughput or latency calculation. This is due to the fact that the nvFuser performs the optimizations on the 3rd step of the first epoch causing a multi-second pause.
375
372
376
373
#### Training performance benchmark
377
374
@@ -390,24 +387,24 @@ We conducted an extensive hyperparameter search along with stability tests. The
390
387
391
388
##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
392
389
393
-
Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs.
390
+
Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs.
394
391
395
392
| Dataset | GPUs | Batch size / GPU | Accuracy - TF32 | Accuracy - mixed precision | Time to train - TF32 | Time to train - mixed precision | Time to train speedup (TF32 to mixed precision)
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
404
401
405
-
Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
402
+
Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
406
403
407
404
| Dataset | GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 | Time to train - mixed precision | Time to train speedup (FP32 to mixed precision)
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
427
424
428
-
Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
425
+
Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -442,14 +439,14 @@ The performance metrics used were items per second.
442
439
443
440
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
444
441
445
-
Our results were obtained by running the `train.sh` training script in the [PyTorch 21.06 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
442
+
Our results were obtained by running the `train.sh` training script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in items/images per second) were averaged over an entire training epoch.
@@ -463,39 +460,44 @@ The performance metrics used were items per second.
463
460
464
461
##### Inference Performance: NVIDIA DGX A100
465
462
466
-
Our results were obtained by running the `inference.py` script in the [PyTorch 21.12 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds.
463
+
Our results were obtained by running the `inference.py` script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX A100. Throughput is measured in items per second and latency is measured in milliseconds.
467
464
To benchmark the inference performance on a specific batch size and dataset, run the `inference.py` script.
Our results were obtained by running the `inference.py` script in the [PyTorch 21.12 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds.
479
+
Our results were obtained by running the `inference.py` script in the [PyTorch 22.11 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) on NVIDIA DGX-1 V100. Throughput is measured in items per second and latency is measured in milliseconds.
483
480
To benchmark the inference performance on a specific batch size and dataset, run the `inference.py` script.
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to https://developer.nvidia.com/deep-learning-performance-training-inference.
496
493
497
494
### Changelog
498
495
496
+
March 2023
497
+
- 23.01 Container Update
498
+
- Switch from NVIDIA Apex AMP and NVIDIA Apex FusedLayerNorm to Native PyTorch AMP and Native PyTorch LayerNorm
0 commit comments