kT`e&-Afa`KZR@Djp5ue3HkUb|MxYWQ4t9btIThqxv3bcdDp6s%=>!a}L=M
zf#IrdlR=QZH`niiBH;5Qn#S{C&dJsky5H6%A5&NV`$_y0q8_$~R9!qK3)Msy{l
(19.5 TFLOPs standard) | 67 TFLOPs | 90 TFLOPs |
-| FP64 Compute | 5.30 TFLOPs | 8.2 TFLOPs | 19.5 TFLOPs
(9.7 TFLOPs standard) | 34 TFLOPs | 45 TFLOPs |
+| GPU | GP100 (Pascal) | V100 (Volta) | A100 (Ampere) | H100 (Hopper) | HGX B200 (Blackwell) |
+|---------------|----------------|--------------|--------------------------------------|---------------|----------------------|
+| Tensor Cores | N/A | 640 | 432 | 528 | TBD |
+| Float 16 bits | 21.2 TFLOPs | 130 TFLOPs | 312 TFLOPs | 989.4 TFLOPs | 2,200 TFLOPs |
+| Float 32 bits | 10.6 TFLOPs | 16.4 TFLOPs | 156 TFLOPs
(19.5 TFLOPs standard) | 494.7 TFLOPs | 1,100 TFLOPs |
+| Float 64 bits | 5.30 TFLOPs | 8.2 TFLOPs | 19.5 TFLOPs
(9.7 TFLOPs standard) | 66.9 TFLOPs | 37 TFLOPs |
-source : https://wccftech.com/nvidia-hopper-h100-gpu-more-powerful-latest-specifications-up-to-67-tflops-fp32-compute/
+sources: [GP100](https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf), [V100](https://images.nvidia.com/content/technologies/volta/pdf/volta-v100-datasheet-update-us-1165301-r5.pdf), [A100](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf), [H100](https://resources.nvidia.com/en-us-hopper-architecture/nvidia-h100-tensor-c)
+and [B200](https://resources.nvidia.com/en-us-blackwell-architecture)
But, now that the hardware is capable of computing a large amount of data per second, the question is: "How can we feed
-these engines with enough data to really take advantage of their computing capability?".
+these engines with enough data to really take advantage of their computing capability?"
Indeed, if we are not able to continuously supply a large amount of data to these Tensor Cores, as powerful the compute
unit may be, it will have to wait for data to be brought to the computing engine (from memory), and computing
performance will be limited by this.
-Another question is how a programmer who is not a hardware expert can take advantage of these “very” advanced
-hardware features.
+Another question is how a programmer who is not a hardware expert can take advantage of these very advanced hardware
+features.
One possible answer to this question might be to use Triton.
Triton is "a Python-based programming environment for productively writing custom DNN compute kernels capable of running
at maximum throughput on modern GPU hardware", as described on
@@ -56,22 +58,24 @@ Let's first step back from the concrete implementation of a kernel and look at G
If we consider simple matrix multiplication : $C += A \times B$, we need to provide Tensor Cores with two input operands
A and B and store the output C.
-In classical GPU architecture, the operands and outputs are provided to Tensor Cores (also called Vector Engines in
-Intel GPUs) using registers.
+In classical GPU architecture, the operands and outputs are provided to Tensor Cores (also called Matrix Engines or XMX
+in Intel GPUs) using registers.
Registers are a kind of small and fast memory bank (called Register File) located just beside the compute engine, as
this can be seen on the following diagrams showing selected parts of an Intel GPU architecture.
-
+
*Illustration of an Intel Xe2 GPU Vector engine architecture (simplified)*
-
+
*Illustration of an Intel XeCore architecture (simplified)*
-Basically, the tensor core reads operands A and B from the *register file* and then writes the accumulated output C
-back to the *file register*.
+Basically, the tensor core reads operands A and B from a the *Register File* and then writes the accumulated output C
+back to the *Register File*.
However, as we have seen in [Introduction](#gpu-tensor-core-and-data-feeding), Tensor Cores have improved significantly,
-making it possible to compute more data per second. But, this implies that we need to feed Tensor Cores with an
+making it possible to compute more and more data per second. But, this implies that we need to feed Tensor Cores with an
increasing amount of data to take advantage of their computing power.
This raises two issues:
@@ -82,7 +86,7 @@ To address the first point, recent GPUs incorporate dedicated engines to load/st
Memory. Therefore, these *Direct Memory Access (DMA)* engines (called *Tensor Memory Accelerator TMA* in Nvidia
architectures) enable to hide the latency of accessing distant memory.
As for increasing throughput, the common idea is to achieve that by sharing as much data as possible between Streaming
-Processor (SM).
+Processors (SM).
Indeed, recent Nvidia architectures (Hopper and later) comes with a *Thread Block Cluster* feature that allows
experienced users to reduce the amount of data fetched from distant memory.
In this post, we won't go into more detail about these features and how to take advantage of them, but we can recommend
@@ -94,15 +98,15 @@ and [Thread Block Cluster](https://research.colfax-intl.com/cutlass-tutorial-gem
As mentioned above, *register files* have a limited size and must contain all the *live* variables we need to execute
the user kernel.
So, if the user kernel requires more *live* variables than the number of available registers, the compiler has to resort
-to register spilling. This technique consists in freeing register by pushing its contents into memory and then loading
+to register spilling. This technique consists in freeing Register by pushing its contents into memory and then loading
it back before the contents can be used.
-As one can imagine, these extra data movements back and forth to memory can severely impact computational performance.
+As can one imagine, this extra data movements back and forth to memory can severely impact computational performance.
Although compilers do their best to avoid register to be spilled by optimizing the generated code to reuse available
-registers as much as possible, sometimes, if the amount of *live* variables is too large, the compiler cannot do its
+registers as much as possible, sometimes, if the amount of *live* variables are too large, the compiler cannot do its
magic and registers have to be spilled.
However, some improvements and techniques can be considered before relying on low-level compilers for "last mile"
-optisations.
+optimisations.
To this end, the 4th and subsequent generations of Nvidia Tensor Cores can load data (i.e., A and B operands) directly
from the Shared Local Memory (SMEM).
@@ -150,57 +154,58 @@ So, in practice, the compiler will decompose the `tl.load` into a two-step proce
1. From Global Memory to L1 Cache (or SMEM)
2. From L1 Cache (or SMEM) to Registers
-In Intel architectures (PVC and BMG), a hardware-managed cache, i.e., the "L1 Data cache", is used to bring data closer
-to the compute unit then data evictions are managed by the hardware in case of conflicts, quite similarly to what is
-done for CPUs.
+In Intel architectures (Ponte Vecchio GPU Max (PVC) and Battlemage Arc B580 GPU), a hardware-managed cache, i.e., the "
+L1 Data cache", is used to to bring data closer to the compute unit then data evictions are managed by the hardware in
+case of conflicts, quite similarly to what is done for CPUs.
The first step of our loading process is therefore achieved by a `TritonIntelGPU::PrefetchOp` which prefetches the data
from Global Memory to the L1 Cache, then the second step is carried out by the `Triton::LoadOp` which loads data into
Registers, hopefully from the L1 cache if the data is still available in cache (cache hit).
The diagram below illustrates this process:
-
+
*Intel Backend Memory Semantic (synchronous)*
-Nvidia has chosen to leverage the Share Local Memory (SMEM) instead of the cache. SMEM is indeed a scratch pad memory
+Nvidia has chosen to leverage the Shared Local Memory (SMEM) instead of the cache. SMEM is indeed a scratch pad memory
explicitly managed by the software. Hence, to accommodate the Nvidia backend, we find in the *Triton GPU dialect* some
operations to manage the SMEM, such as `TritonGPU::LocalAllocOp` `TritonGPU::LocalDeallocOp` to allocate and deallocate
a memory buffer in SMEM, but also `TritonGPU::LocalLoadOp` and `TritonGPU::LocalStoreOp` to handle the data transfers
between SMEM and Registers.
-Consequently, the triton process for loading and storing data (synchronously) in the Nvidia architecture is as follow:
+Consequently, the Triton process for loading and storing data (synchronously) in the Nvidia architecture is as follows:
-
+
*Nvidia Backend Memory Semantic (synchronous)*
-
---
**NOTE**
-It worth noting here, that Nvidia Tensor Core version 4 and later, are special operations that do not require there
-operand to be in Registers but operands can (or have to be) in SMEM for the *mma* operation to take place. Consequently,
-the compiler does not need to explictly load the tensor from SMEM to register, but the *mma* operation mamange its
-operands itself.
+It's worth noting here that Nvidia Tensor Core version 4 and later are special operations that do not require their
+operand to be in Registers, but operands can (or have to be) in SMEM for the *mma* operation to take place.
+Consequently, the compiler does not need to explicitly load the tensor from SMEM to register, but the *mma* operation
+manages its operands itself.
---
### Variable liveness and Register reservation
-We say that a variable is *live* at a given point of a program if the variable contains a value that may be use in the
+We say that a variable is *live* at a given point of a program if the variable contains a value that may be used in the
future.
-In the following example, the variable A is *live* from line 1 to line 7, where the last used of the variable A is
-found.
+In the following example, the variable A is *live* from line 1 to line 7, where the last use of the variable A is found.
As for the variable B, its liveness only spans from line 4 to line 5.
When register assignment is performed during compilation, the compiler attempts to keep A in registers for all its
-livespan.
+lifespan.
So, in our example, if A needs $NumReg_A$ registers to be stored, this means that $NumReg_A$ registers will be reserved
for A across the loop, and thus the compiler needs to fit the variables used between line 1 and 7 in $N - NumReg_A$
registers, with $N$ being the total number of registers available.
-
+
*Variable liveness simple example*
It is therefore easy to understand that in such a kernel, if the variable A is large and the kernel processing between
-lines 2 and 7 is also register consuming, the compiler may have hard time to allocate registers while avoiding register
-spills.
+lines 2 and 7 is also consuming register, the compiler may have a hard time to allocate registers while avoiding
+register spills.
This is exactly what happens in the widespread case of [FlashAttention version 2](https://arxiv.org/abs/2307.08691).
@@ -252,16 +257,16 @@ def FlashAttention2_forward(Q, K, V):
In the second version of the implementation of the FlashAttention model, the loop order has been reversed to promote
data locality.
As long as there is enough local memory (or registers) to contain all the needed data, this algorithm works fine and
-provide significant performance improvement compared to FlashAttention v1 (in the paper, the authors mention 2x faster
-for the Cutlass implementation and 1.3-1.5√ó faster in Triton on an Nvidia Amper GPU A100).
+provides significant performance improvements compared to FlashAttention v1 (in the paper, the authors mention 2x faster
+for the Cutlass implementation and 1.3-1.5× faster in Triton on an Nvidia Ampere GPU A100).
Deployed on a GPU target, line 4-10 constitutes the computing kernel that is dispatched to a Thread Block/Work-Group (
i.e. a SM/XeCore).
But as you can see, variable Q is loaded before the loop (line 4) and remains *live* across the loop.
The long lifespan of variable Q is even more problematic in the causal variation of the FlashAttention implementation.
The causal variation is defined in the paper as :
-> One common use case of attention is in auto-regressive language modelling, where we
-> need to apply a causal mask to the attention matrix S (i.e., any entry Sùëñ ùëó with ùëó > ùëñ is set to ‚àí‚àû).
+> One common use case of attention is in auto-regressive language modeling, where we
+> need to apply a causal mask to the attention matrix S (i.e., any entry S𝑖 𝑗 with 𝑗 > 𝑖 is set to −∞).
The Triton implementation of FlashAttention v2 with causal mask is as follow:
@@ -362,9 +367,9 @@ another for calculating the data to which the mask does need to be applied).
Our point is that variable `q` that loads a chunk of the Q matrix is *live* for the instruction `tl.load(Q)` until the
second loop where variable `q` is read for the last time.
When the target GPU architecture is able to load its operands directly from SMEM or TMEM (as discussed in the previous
-section), this is less of problem because these memories are larger than the register files.
-But when the target GPU does not have this capability, such as Intel PVC and BMG GPUs, the variable has to reside in
-registers.
+section), this is less of a problem because these memories are larger than the register files.
+But when the target GPU does not have this capability, such as Intel Ponte Veccio and Battlemage GPUs, the variable has
+to reside in registers.
Thus, registers should be dedicated to saving this variable all along the kernel execution.
Consequently, if variable `q` is large, many registers will be reserved for saving this variable and the register
allocator will be forced to spill registers.
@@ -373,7 +378,7 @@ allocator will be forced to spill registers.
The register allocator's difficulty in assigning registers without spilling comes from the fact that some variables are
*live* for a long period of time, which reduces the number of registers available for other variables.
-Even worse, the high pressure on registers can prevent the compiler for applying other optimisations such as loop
+Even worse, the high pressure on registers can prevent the compiler from applying other optimisations such as loop
unrolling (which requires additional registers).
As a consequence, to reduce the liveness of variables, when possible, relaxes the constraints on register allocations
@@ -385,9 +390,10 @@ In the [XPU backend](https://github.com/intel/intel-xpu-backend-for-triton/tree/
added
an [optimization pass](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/third_party/intel/lib/TritonIntelGPUTransforms/ReduceVariableLiveness.cpp)
which aims to reduce variable liveness where possible.
-To this ends, the pass attempts to bring load operations closer to the actual uses of the loaded data.
+To this end, the pass attempts to bring load operations closer to the actual uses of the loaded data.
-
+
*Reduce Variable Liveness pass diagram*
The diagram above shows how the compiler pass works to reduce the liveness of `DotOp` operands.
@@ -397,11 +403,11 @@ This analysis computes the liveness of each variable in the source code. This wi
variables at a specific point in the source code.
In the second step, the pass looks for `DotOp` (i.e., the matrix multiplication operation) in a `For` loop. Currently,
-the pass only considers `DotOp` in `For` loops as achor because it is a resource-consuming operation that is critical to
-the performance of AI kernels.
-But the pass can be extend to other cases in the future.
+the pass only considers `DotOp` in `For` loops as anchor because it is a resource-consuming operation that is critical
+to the performance of AI kernels.
+But the pass can be extended to other cases in the future.
-The third steps is to retrieve the `loadOp` that loads the `DotOp` operands.
+The third step is to retrieve the `loadOp` that loads the `DotOp` operands.
In brief, the pass rolls back the def-chain of the `DotOp` operands and returns the `loadOp` when it is found.
Next, the pass checks if the `loadOp` is eligible to be moved. To be a candidate, a few conditions must be met:
@@ -412,7 +418,7 @@ Next, the pass checks if the `loadOp` is eligible to be moved. To be a candidate
- Empirically, we observe that loading a large amount of data is not the only criteria to determine if moving
the `loadOp` is needed. The Triton language defines kernel at block/work-group level, but this group is then handled
by multiple warps/sub-group (and threads/work-items). This sub-division into warps has an impact on the way data is
- loaded and on the register assignment policy. As a result, the amount of loaded data must be large, but also, the
+ loaded and on the register assignment policy. As a result, the amount of loaded data must be large, but, also, the
shape of the data on the dimension that is not split between warps must be large enough for the proposed optimisation
to be relevant.
- The `loadOp` must be outside the loop body, i.e., the operand must be a *live-in* variable of the loop.
@@ -422,28 +428,30 @@ Otherwise, if these conditions are met for at least one of the operands, the pas
care of moving the `loadOp`.
If the `loadOp` has only one user (i.e., the `DotOp`), the load operation is sinked into the loop and a prefetch
operation (`prefetchOp`) is inserted where the `loadOp` was initially located, as shown on the diagram.
-The prefetch operation fetches the data from global memory into the cache. As a result, when the actual load take place,
-the data is loaded from the cache and not from the global memory.
+The prefetch operation fetches the data from global memory into the cache. As a result, when the actual load takes
+place, the data is loaded from the cache and not from the global memory.
The case where the `loadOp` has more than one user, is a little more complex as the `loadOp` cannot simply be sunk into
-the loop, but the pass must ensure that a load operation take place before accessing the data.
+the loop, but the pass must ensure that a load operation takes place before accessing the data.
As loading data from the cache is not expensive, we chose to add another `loadOp` for subsequent uses. Hence, the
liveness of the tensor is still reduced and the low-level compiler (*igc* for intel target) is able to perform its
optimisations with less constraints on registers.
#### Performance improvement
-We have evaluated the performance of Triton FlashAttention v2 on Intel PVC GPU.
+We have evaluated the performance of Triton FlashAttention v2 on Intel GPU Max PVC.
The following plots show the normalised performance of the FlashAttention kernel with the *reduce-liveness-pass* enabled
for different input configurations.
-
+
*FlashAttention v2 Normalized performance PVC1100*
-
+
*FlashAttention v2 Normalized performance PVC1550*
+The testbed used for these evaluations and a disclaimer can be found [at the bottom](#disclaimer) of this blog post.
+
We can see that the pass has improved the performance for several configurations on all the targets evaluated by more
-than 5%, and up to 10% for a some of them.
+than 5%, and up to 10% for some of them.
As expected the inputs configurations impacted by this optimisation are those with :
- causal mask applied to the computed data
@@ -463,16 +471,15 @@ We also exposed that low-level compilers may lack some knowledge about the sourc
generating optimised code.
In our use case, the *igc* compiler tried to assign registers based on the lifespan of the variables without knowing
that the lifespan of some variables could be reduced to avoid register spilling.
-Consequently, we take advantage of the progressive lowering of MLIR-based compiler to add an optimisation pass to the
+Consequently, we take advantage of the progressive lowering of MLIR-based compilers to add an optimisation pass to the
intel XPU backend of the Triton compiler.
This pass aims to reduce the liveness of variables under certain conditions.
-As a result, the performance of FlashAttention on PVC GPUs has been improved by up to 10% for certain input
-configurations.
+As a result, the performance of FlashAttention could be improved by up to 10% fo certain input configurations.
A limitation of this pass is that we assume the data to be loaded is available in the L1 cache, so the load operations
are cheap and can be easily moved around in the code.
However, this might not be the case if cache conflicts occurred and the data was evicted from the cache before being
-loaded into registers. This is likely to happen for GPU with small cache. If this happen, the load operation becomes
+loaded into registers. This is likely to happen for GPUs with small cache. If this happens, the load operation becomes
expensive and sinking a `loadOp` inside the loop body is far from a good idea.
A future extension of the pass could therefore consider first loading the data into the SMEM, which is explicitly
managed by the software, and then loading the data from the SMEM into registers instead of relying on the cache.