Skip to content

Commit b985c34

Browse files
authored
Update index.md
1 parent 8fd66a4 commit b985c34

File tree

1 file changed

+4
-4
lines changed
  • docs/source/user-guide/sparse-attention

1 file changed

+4
-4
lines changed

docs/source/user-guide/sparse-attention/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
## Motivations
33
Attention mechanisms, especially in LLMs, are often the bottleneck in terms of latency during inference due to their computational complexity. Despite their importance in capturing contextual relationships, traditional attention requires processing all token interactions, leading to significant delays.
44

5-
![Attention Overhead](/_static/images/attention_overhead.png)
5+
![Attention Overhead](/docs/source/_static/images/attention_overhead.png)
66

77
Researchers have found that attention in LLM is highly dispersed:
8-
![Attention Sparsity](/_static/images/attention_sparsity.png)
8+
![Attention Sparsity](/docs/source/_static/images/attention_sparsity.png)
99

1010
This movitates them actively developing sparse attention algorithms to address the latency issue. These algorithms aim to reduce the number of token interactions by focusing only on the most relevant parts of the input, thereby lowering the computation and memory requirements.
1111
While promising, the gap between theoretical prototypes and practical implementations in inference frameworks remains a significant challenge.
@@ -19,7 +19,7 @@ By utilizing UCM, researchers can efficiently implement rapid prototyping and te
1919
## Architecture
2020
### Overview
2121
The core concept of our UCMSparse attention framework is to offload the complete Key-Value (KV) cache to a dedicated KV cache storage. We then identify the crucial KV pairs relevant to the current context, as determined by our sparse attention algorithms, and selectively load only the necessary portions of the KV cache from storage into High Bandwidth Memory (HBM). This design significantly reduces the HBM footprint while accelerating generation speed.
22-
![Sparse Attn Arch](/_static/images/sparse_attn_arch.png)
22+
![Sparse Attn Arch](/docs/source/_static/images/sparse_attn_arch.png)
2323

2424

2525
### Key Concepts
@@ -41,4 +41,4 @@ esa
4141
gsa
4242
kvcomp
4343
kvstar
44-
:::
44+
:::

0 commit comments

Comments
 (0)