Skip to content

Commit 5f2cacd

Browse files
authored
Quick fix for IMA with the Prefix Prefill kernel during graph capture (vllm-project#25983)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
1 parent aa5053e commit 5f2cacd

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

vllm/v1/attention/backends/rocm_attn.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,14 @@ def build_for_cudagraph_capture(
8383
# max_model_len will cause graph capture to be extremely
8484
# slow, so here we set it to 1.
8585
attn_metadata.seq_lens.fill_(1)
86+
87+
if envs.VLLM_V1_USE_PREFILL_DECODE_ATTENTION:
88+
# Here we set the query start locs to 0. This is to
89+
# cover up an invalid memory access in the prefix_prefil kernel
90+
# that we run into during graph capture (#25985)
91+
common_attn_metadata.query_start_loc.zero_()
92+
common_attn_metadata.query_start_loc_cpu.zero_()
93+
8694
return attn_metadata
8795

8896
def build(self,

0 commit comments

Comments
 (0)