Skip to content

Commit dfa9f12

Browse files
authored
Update index.md
1 parent b985c34 commit dfa9f12

File tree

1 file changed

+3
-3
lines changed
  • docs/source/user-guide/sparse-attention

1 file changed

+3
-3
lines changed

docs/source/user-guide/sparse-attention/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
## Motivations
33
Attention mechanisms, especially in LLMs, are often the bottleneck in terms of latency during inference due to their computational complexity. Despite their importance in capturing contextual relationships, traditional attention requires processing all token interactions, leading to significant delays.
44

5-
![Attention Overhead](/docs/source/_static/images/attention_overhead.png)
5+
![Attention Overhead](../../_static/images/attention_overhead.png)
66

77
Researchers have found that attention in LLM is highly dispersed:
8-
![Attention Sparsity](/docs/source/_static/images/attention_sparsity.png)
8+
![Attention Sparsity](../../_static/images/attention_sparsity.png)
99

1010
This movitates them actively developing sparse attention algorithms to address the latency issue. These algorithms aim to reduce the number of token interactions by focusing only on the most relevant parts of the input, thereby lowering the computation and memory requirements.
1111
While promising, the gap between theoretical prototypes and practical implementations in inference frameworks remains a significant challenge.
@@ -19,7 +19,7 @@ By utilizing UCM, researchers can efficiently implement rapid prototyping and te
1919
## Architecture
2020
### Overview
2121
The core concept of our UCMSparse attention framework is to offload the complete Key-Value (KV) cache to a dedicated KV cache storage. We then identify the crucial KV pairs relevant to the current context, as determined by our sparse attention algorithms, and selectively load only the necessary portions of the KV cache from storage into High Bandwidth Memory (HBM). This design significantly reduces the HBM footprint while accelerating generation speed.
22-
![Sparse Attn Arch](/docs/source/_static/images/sparse_attn_arch.png)
22+
![Sparse Attn Arch](../../_static/images/sparse_attn_arch.png)
2323

2424

2525
### Key Concepts

0 commit comments

Comments
 (0)