Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 8, 2025

Adding -tgs to llama-batched-bench would make it decode the sequences separately, one by one:

# no -tgs
0123 0123 0123 ...

# -tgs
0 0 0 ... 1 1 1 ... 2 2 2 ... 3 3 3 ...

This is useful for benchmarking the performance of the unified KV cache where it's important to detect and skip masked regions in the KQ mask.

Example with the Metal backend:

# unified KV cache with up to 4 sequences, running one by one
llama-batched-bench -m ../models/gemma-3-4b-it/ggml-model-f16.gguf -c 33792 -npp 8192 -ntg 32 -npl 1,2,4 -kvu -tgs

# the cache looks like this
#
#                        prompt processing ends here v
# 000...[8192 tokens]...000111...111222...222333...333000...[32 tokens]...000111...111222...222333...333
#.                             text generation starts ^

With the -INF block optimizations in the FA kernels:

main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.101 |  2641.88 |    0.478 |    66.95 |    3.579 |  2298.00 |
|  8192 |     32 |    2 |  16448 |    6.091 |  2689.76 |    0.971 |    65.90 |    7.062 |  2328.95 |
|  8192 |     32 |    4 |  32896 |   12.373 |  2648.43 |    1.965 |    65.15 |   14.337 |  2294.45 |

Disabling the -INF block optimizations in the FA kernels:

patch
diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index cea535ade..6c249fb56 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -4633,7 +4633,7 @@ kernel void kernel_flash_attn_ext_blk(
     const int32_t nblk0 = ((args.ne30 + C - 1)/C);
 
     if (tiisg == 0) {
-        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = res;
+        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = 1;
     }
 }
 
@@ -5660,7 +5660,7 @@ void kernel_flash_attn_ext_vec_impl(
             }
 
             // skip -INF blocks
-            if (simd_max(sm[tiisg]) == -INFINITY) {
+            if (false) {
                 continue;
             }
 
main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.528 |  2321.91 |    0.500 |    64.05 |    4.028 |  2041.84 |
|  8192 |     32 |    2 |  16448 |    7.393 |  2216.28 |    1.027 |    62.30 |    8.420 |  1953.47 |
|  8192 |     32 |    4 |  32896 |   16.157 |  2028.06 |    2.159 |    59.30 |   18.316 |  1796.04 |

Observe that both pp and tg perf is worse and it's amplified with more sequences in the cache.

@ggerganov ggerganov merged commit f914544 into master Nov 10, 2025
71 checks passed
@ggerganov ggerganov deleted the gg/batched-bench-separate-tg branch November 10, 2025 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants