1- Using CUDA Graphs in PyTorch C++ API
2- ====================================
1+ PyTorch C++ APIμμ CUDA κ·Έλν μ¬μ©νκΈ°
2+ ===========================================
3+
4+ **λ²μ **: `μ₯ν¨μ <https://github.com/hyoyoung >`_
35
46.. note ::
5- |edit | View and edit this tutorial in `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst >`__. The full source code is available on `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs >`__.
7+ |edit | μ΄ νν 리μΌμ μ¬κΈ°μ λ³΄κ³ νΈμ§νμΈμ `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs.rst >`__. μ 체 μμ€ μ½λλ μ¬κΈ°μ μμ΅λλ€ `GitHub <https://github.com/pytorch/tutorials/blob/main/advanced_source/cpp_cuda_graphs >`__.
68
7- Prerequisites :
9+ μ μ μ§μ :
810
9- - `Using the PyTorch C++ Frontend <../advanced_source/cpp_frontend.html >`__
11+ - `PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ° <../advanced_source/cpp_frontend.html >`__
1012- `CUDA semantics <https://pytorch.org/docs/master/notes/cuda.html >`__
11- - Pytorch 2.0 or later
12- - CUDA 11 or later
13-
14- NVIDIAβs CUDA Graphs have been a part of CUDA Toolkit library since the
15- release of `version 10 <https://developer.nvidia.com/blog/cuda-graphs/ >`_.
16- They are capable of greatly reducing the CPU overhead increasing the
17- performance of applications.
18-
19- In this tutorial, we will be focusing on using CUDA Graphs for `C++
20- frontend of PyTorch <https://tutorials.pytorch.kr/advanced/cpp_frontend.html> `_.
21- The C++ frontend is mostly utilized in production and deployment applications which
22- are important parts of PyTorch use cases. Since `the first appearance
23- <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/> `_
24- the CUDA Graphs won usersβ and developerβs hearts for being a very performant
25- and at the same time simple-to-use tool. In fact, CUDA Graphs are used by default
26- in ``torch.compile `` of PyTorch 2.0 to boost the productivity of training and inference.
27-
28- We would like to demonstrate CUDA Graphs usage on PyTorchβs `MNIST
29- example <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_.
30- The usage of CUDA Graphs in LibTorch (C++ Frontend) is very similar to its
31- `Python counterpart <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs >`_
32- but with some differences in syntax and functionality.
33-
34- Getting Started
13+ - Pytorch 2.0 μ΄μ
14+ - CUDA 11 μ΄μ
15+
16+ NVIDIAμ CUDA κ·Έλνλ λ²μ 10 λ¦΄λ¦¬μ¦ μ΄νλ‘ CUDA ν΄ν· λΌμ΄λΈλ¬λ¦¬μ μΌλΆμμ΅λλ€
17+ `version 10 <https://developer.nvidia.com/blog/cuda-graphs/ >`_.
18+ CPU κ³ΌλΆνλ₯Ό ν¬κ² μ€μ¬ μ ν리μΌμ΄μ
μ μ±λ₯μ ν₯μμν΅λλ€.
19+
20+ μ΄ νν 리μΌμμλ, CUDA κ·Έλν μ¬μ©μ μ΄μ μ λ§μΆ κ²μ
λλ€
21+ `PyTorch C++ νλ‘ νΈμλ μ¬μ©νκΈ° <https://tutorials.pytorch.kr/advanced/cpp_frontend.html >`_.
22+ C++ νλ‘ νΈμλλ νμ΄ν μΉ μ¬μ© μ¬λ‘μ μ€μν λΆλΆμΈλ°, μ£Όλ‘ μ ν λ° λ°°ν¬ μ ν리μΌμ΄μ
μμ νμ©λ©λλ€.
23+ `첫λ²μ§Έ λ±μ₯ <https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ >`_
24+ μ΄νλ‘ CUDA κ·Έλνλ λ§€μ° μ±λ₯μ΄ μ’κ³ μ¬μ©νκΈ° μ¬μμ, μ¬μ©μμ κ°λ°μμ λ§μμ μ¬λ‘μ‘μμ΅λλ€.
25+ μ€μ λ‘, CUDA κ·Έλνλ νμ΄ν μΉ 2.0μ ``torch.compile `` μμ κΈ°λ³Έμ μΌλ‘ μ¬μ©λλ©°,
26+ νλ ¨κ³Ό μΆλ‘ μμ μμ°μ±μ λμ¬μ€λλ€.
27+
28+ νμ΄ν μΉμμ CUDA κ·Έλν μ¬μ©λ²μ 보μ¬λλ¦¬κ³ μ ν©λλ€ `MNIST
29+ μμ <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_.
30+ LibTorch(C++ νλ‘ νΈμλ)μμμ CUDA κ·Έλν μ¬μ©λ²μ λ€μκ³Ό λ§€μ° μ μ¬νμ§λ§
31+ `Python μ¬μ©μμ <https://pytorch.org/docs/main/notes/cuda.html#cuda-graphs >`_
32+ μ½κ°μ ꡬ문과 κΈ°λ₯μ μ°¨μ΄κ° μμ΅λλ€.
33+
34+ μμνκΈ°
3535---------------
3636
37- The main training loop consists of the several steps and depicted in the
38- following code chunk:
37+ μ£Όμ νλ ¨ 루νλ μ¬λ¬ λ¨κ³λ‘ ꡬμ±λμ΄ μμΌλ©°
38+ λ€μ μ½λ λͺ¨μμ μ€λͺ
λμ΄ μμ΅λλ€.
3939
4040.. code-block :: cpp
4141
@@ -49,12 +49,12 @@ following code chunk:
4949 optimizer.step();
5050 }
5151
52- The example above includes a forward pass, a backward pass, and weight updates .
52+ μμ μμμλ μμ ν, μμ ν, κ°μ€μΉ μ
λ°μ΄νΈκ° ν¬ν¨λμ΄ μμ΅λλ€ .
5353
54- In this tutorial, we will be applying CUDA Graph on all the compute steps through the whole-network
55- graph capture. But before doing so, we need to slightly modify the source code. What we need
56- to do is preallocate tensors for reusing them in the main training loop. Here is an example
57- implementation:
54+ μ΄ νν 리μΌμμλ μ 체 λ€νΈμν¬ κ·Έλν μΊ‘μ²λ₯Ό ν΅ν΄ λͺ¨λ κ³μ° λ¨κ³μ CUDA κ·Έλνλ₯Ό μ μ©ν©λλ€.
55+ νμ§λ§ κ·Έ μ μ μ½κ°μ μμ€ μ½λ μμ μ΄ νμν©λλ€. μ°λ¦¬κ° ν΄μΌ ν μΌμ μ£Ό νλ ¨ 루νμμ
56+ tensorλ₯Ό μ¬μ¬μ©ν μ μλλ‘ tensorλ₯Ό 미리 ν λΉνλ κ²μ
λλ€.
57+ λ€μμ ꡬν μμμ
λλ€.
5858
5959.. code-block :: cpp
6060
@@ -74,7 +74,7 @@ implementation:
7474 training_step(model, optimizer, data, targets, output, loss);
7575 }
7676
77- Where ``training_step `` simply consists of forward and backward passes with corresponding optimizer calls:
77+ μ¬κΈ°μ ``training_step``μ λ¨μν ν΄λΉ μ΅ν°λ§μ΄μ νΈμΆκ³Ό ν¨κ» μμ ν λ° μμ νλ‘ κ΅¬μ±λ©λλ€
7878
7979.. code-block:: cpp
8080
@@ -92,7 +92,7 @@ Where ``training_step`` simply consists of forward and backward passes with corr
9292 optimizer.step();
9393 }
9494
95- PyTorchβs CUDA Graphs API is relying on Stream Capture which in our case would be used like this:
95+ νμ΄ν μΉμ CUDA κ·Έλν APIλ μ€νΈλ¦Ό μΊ‘μ²μ μμ‘΄νκ³ μμΌλ©°, μ΄ κ²½μ° λ€μμ²λΌ μ¬μ©λ©λλ€
9696
9797.. code-block:: cpp
9898
@@ -104,9 +104,9 @@ PyTorchβs CUDA Graphs API is relying on Stream Capture which in our case would
104104 training_step(model, optimizer, data, targets, output, loss);
105105 graph.capture_end();
106106
107- Before the actual graph capture, it is important to run several warm-up iterations on side stream to
108- prepare CUDA cache as well as CUDA libraries (like CUBLAS and CUDNN) that will be used during
109- the training:
107+ μ€μ κ·Έλν μΊ‘μ² μ μ, μ¬μ΄λ μ€νΈλ¦Όμμ μ¬λ¬ λ²μ μλ°μ
λ°λ³΅μ μ€ννμ¬
108+ CUDA μΊμλΏλ§ μλλΌ νλ ¨ μ€μ μ¬μ©ν
109+ CUDA λΌμ΄λΈλ¬λ¦¬(CUBLASμ CUDNNκ°μ)λ₯Ό μ€λΉνλ κ²μ΄ μ€μν©λλ€.
110110
111111.. code-block:: cpp
112112
@@ -116,13 +116,13 @@ the training:
116116 training_step(model, optimizer, data, targets, output, loss);
117117 }
118118
119- After the successful graph capture, we can replace ``training_step(model, optimizer, data, targets , output, loss); ``
120- call via ``graph.replay(); `` to do the training step .
119+ κ·Έλν μΊ‘μ²μ μ±κ³΅νλ©΄ ``training_step(model, optimizer, data, target , output, loss); `` νΈμΆμ
120+ ``graph.replay()``λ‘ λ체νμ¬ νμ΅ λ¨κ³λ₯Ό μ§νν μ μμ΅λλ€ .
121121
122- Training Results
122+ νλ ¨ κ²°κ³Ό
123123----------------
124124
125- Taking the code for a spin we can see the following output from ordinary non-graphed training:
125+ μ½λλ₯Ό ν λ² μ΄ν΄λ³΄λ©΄ κ·Έλνκ° μλ μΌλ° νλ ¨μμ λ€μκ³Ό κ°μ κ²°κ³Όλ₯Ό λ³Ό μ μμ΅λλ€
126126
127127.. code-block:: shell
128128
@@ -152,7 +152,7 @@ Taking the code for a spin we can see the following output from ordinary non-gra
152152 user 0m44.018s
153153 sys 0m1.116s
154154
155- While the training with the CUDA Graph produces the following output:
155+ CUDA κ·Έλνλ₯Ό μ¬μ©ν νλ ¨μ λ€μκ³Ό κ°μ μΆλ ₯μ μμ±ν©λλ€
156156
157157.. code-block:: shell
158158
@@ -182,12 +182,11 @@ While the training with the CUDA Graph produces the following output:
182182 user 0m7.048s
183183 sys 0m0.619s
184184
185- Conclusion
185+ κ²°λ‘
186186----------
187-
188- As we can see, just by applying a CUDA Graph on the `MNIST example
189- <https://github.com/pytorch/examples/tree/main/cpp/mnist> `_ we were able to gain the performance
190- by more than six times for training. This kind of large performance improvement was achievable due to
191- the small model size. In case of larger models with heavy GPU usage, the CPU overhead is less impactful
192- so the improvement will be smaller. Nevertheless, it is always advantageous to use CUDA Graphs to
193- gain the performance of GPUs.
187+ μ μμμμ λ³Ό μ μλ―μ΄, λ°λ‘ `MNIST μμ
188+ <https://github.com/pytorch/examples/tree/main/cpp/mnist>`_ μ CUDA κ·Έλνλ₯Ό μ μ©νλ κ²λ§μΌλ‘λ
189+ νλ ¨ μ±λ₯μ 6λ°° μ΄μ ν₯μμν¬ μ μμμ΅λλ€.
190+ μ΄λ κ² ν° μ±λ₯ ν₯μμ΄ κ°λ₯νλ κ²μ λͺ¨λΈ ν¬κΈ°κ° μμκΈ° λλ¬Έμ
λλ€.
191+ GPU μ¬μ©λμ΄ λ§μ λν λͺ¨λΈμ κ²½μ° CPU κ³ΌλΆνμ μν₯μ΄ μ κΈ° λλ¬Έμ κ°μ ν¨κ³Όκ° λ μμ μ μμ΅λλ€.
192+ κ·Έλ° κ²½μ°λΌλ, GPUμ μ±λ₯μ μ΄λμ΄λ΄λ €λ©΄ CUDA κ·Έλνλ₯Ό μ¬μ©νλ κ²μ΄ νμ μ 리ν©λλ€.
0 commit comments