Skip to content

Commit 811056b

Browse files
l-batrkazants
andauthored
Add XAttention documentation (#3010)
## Description <!-- Please include a summary of the change. Also include relevant motivation and context. --> <!-- Jira ticket number (e.g., 123). Delete if there's no ticket. --> CVS-173857 --------- Co-authored-by: Roman Kazantsev <roman.kazantsev@intel.com>
1 parent dc1c2eb commit 811056b

File tree

4 files changed

+24
-3
lines changed

4 files changed

+24
-3
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ All scenarios are run on top of OpenVINO Runtime that supports inference on CPU,
8282
OpenVINO™ GenAI library provides a transparent way to use state-of-the-art generation optimizations:
8383
- Speculative decoding that employs two models of different sizes and uses the large model to periodically correct the results of the small model. See [here](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) for more detailed overview
8484
- KVCache token eviction algorithm that reduces the size of the KVCache by pruning less impacting tokens.
85+
- Sparse attention, which accelerates prefill by attending only to the most important regions of the attention matrix. OpenVINO GenAI currently supports two modes: Tri-shape and XAttention. See [here](https://openvinotoolkit.github.io/openvino.genai/docs/concepts/optimization-techniques/sparse-attention-prefill) for more details.
8586

8687
Additionally, OpenVINO™ GenAI library implements a continuous batching approach to use OpenVINO within LLM serving. The continuous batching library could be used in LLM serving frameworks and supports the following features:
8788
- Prefix caching that caches fragments of previous generation requests and corresponding KVCache entries internally and uses them in case of repeated query.

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,28 @@ The prompt processing occurs as usual until at least two KV cache blocks have be
4242
After that, for the next prompt chunks only the first and the last/second-last blocks processed will be visible as KV cache contents, effectively introducing sparsity in the attention computation for the rest of the KV cache "body" (`t = 2-4`).
4343

4444
Upon reaching the tail of the prompt the KV cache for the entire prompt is used in attention again, effectively switching back from the sparse attention mode to "dense" attention (`t = 5`).
45-
Apart from improving the generation accuracy, this also makes it possible to effectively combine the tri-shape sparse prefill algorithm with the cache eviction algorithm, which relies on the model having "seen" the entire prompt KV cache when processing the last tokens of the prompt. The "dense attention" portion of the prompt can be configured using the `SparseAttentionConfig.num_retained_recent_tokens_in_cache` field.
45+
Apart from improving the generation accuracy, this also makes it possible to effectively combine the tri-shape sparse prefill algorithm with the cache eviction algorithm, which relies on the model having "seen" the entire prompt KV cache when processing the last tokens of the prompt. The "dense attention" portion of the prompt can be configured using the `SparseAttentionConfig.num_last_dense_tokens_in_prefill` field.
4646

4747

4848
### XAttention
49-
TBA
49+
For the XAttention algorithm, the prefill computation is accelerated by selectively attending only to the most important regions of the attention matrix, determined dynamically through antidiagonal-based importance estimation. During the prefill stage, each query block attends only to the subset of key blocks whose cumulative estimated attention mass exceeds a predefined threshold, while the rest of the KV cache blocks are excluded from the attention computation.
5050

51+
To enable XAttention with default settings, select `SparseAttentionMode.XATTENTION` in `SparseAttentionConfig.mode` within the `SchedulerConfig` and set `use_sparse_attention=True`:
52+
```python
53+
cb_config = SchedulerConfig(
54+
use_sparse_attention=True,
55+
sparse_attention_config=SparseAttentionConfig(
56+
mode=SparseAttentionMode.XATTENTION
57+
)
58+
)
59+
```
5160

61+
The importance estimation procedure consists of two stages. In the first stage, using stride-based reshaping, the query and key tensors are permuted along antidiagonal patterns, with the stride value determined by the `SparseAttentionConfig.xattention_stride` parameter. The reshaped tensors are then used to compute a coarse estimate of the attention mass per block, with the block size defined by `SparseAttentionConfig.xattention_block_size`. The attention values within each block are summed to produce an importance score that represents the approximate total attention mass associated with that block. In the second stage, for each query block, the corresponding key blocks are sorted in descending order of their estimated attention mass. The algorithm then identifies the minimal subset of blocks whose cumulative antidiagonal attention exceeds the predefined threshold `SparseAttentionConfig.xattention_threshold`. The block selection process always retains the diagonal blocks - corresponding to the most recently processed query positions, as well as the least recent KV cache block.
5262

63+
![XAttention sparse prefill illustrated](./../../../static/img/xattention.svg)
64+
65+
The picture above illustrates the XAttention algorithm in more detail. For simplicity, it is presumed that the prompt occupies 8 full KV cache blocks and is processed within 5 chunks. The `xattention_block_size` corresponds to one HW-dependent block of tokens.
66+
67+
The prompt processing occurs as usual until at least two KV cache blocks have been completely filled (`t = 0, 1`). Once the block-level importance scores have been computed (`t = 2-4`), only the subset of KV blocks with cumulative attention mass exceeding the `xattention_threshold` is retained for attention computation, effectively introducing sparsity in the attention computation.
68+
69+
Upon reaching the tail of the prompt, the KV cache corresponding to the entire prompt becomes visible again, reverting to dense attention mode (`t = 5`). This transition ensures that the model attends to the complete prompt context before entering the generation stage. Similar to the tri-shape algorithm, the final dense portion of the prefill can be configured using the `SparseAttentionConfig.num_last_dense_tokens_in_prefill` field. Due to the block-wise cache organization and scheduler chunking, the actual number of prompt tokens processed with dense attention may slightly exceed the specified value, potentially extending across a full block or subsequence chunk depending on the hardware configuration.

site/static/img/trishape.svg

Lines changed: 1 addition & 1 deletion
Loading

site/static/img/xattention.svg

Lines changed: 3 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)