You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/user-guide/sparse-attention/gsa.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ GSA (Geometric Sparse Attention) simultaneously tackles the high computational c
29
29
30
30
31
31
## 🔥 Key Results
32
-
In both performance and accuracy evaluations, we employed the DeepSeek-R1-Distill-Qwen-32B model deployed on two H20 GPUs.
32
+
In both performance and accuracy evaluations, we deployed the DeepSeek-R1-Distill-Qwen-32B model on two H20 GPUs.
33
33
## 🏆 Performance Highlights
34
34
### End-to-End Performance with 80 % Prefix-Cache Hit Ratio
35
35
Below are the end-to-end throughput results for inference scenarios without KVCache offloading. PC Baseline refers to the full attention method with an 80% prefix cache hit rate. The GSA method sparsifies each input request to 6K tokens, and in the experiments, each request generates 4K tokens of output.
0 commit comments