Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
c3af2af
Split PR. Second part. Compile ranges
ilmarkov Sep 4, 2025
0cbb065
Remove general shape graph
ilmarkov Sep 4, 2025
d5392f5
Add test to test pipeline
ilmarkov Sep 5, 2025
027c9eb
Fix pre-commit
ilmarkov Sep 9, 2025
b2992d3
Upd
ilmarkov Oct 16, 2025
3499384
Upd config
ilmarkov Oct 16, 2025
5336ee6
Fix
ilmarkov Oct 16, 2025
4958474
Priotitize compile_sizes
ilmarkov Oct 17, 2025
04306ed
Fix inductor config
ilmarkov Oct 28, 2025
9dc4eea
Laith's fix
ilmarkov Nov 3, 2025
2c63f0b
Upd
ilmarkov Nov 4, 2025
8b8d01d
Merge branch 'imarkov/fused_allreduce_torch_native' into imarkov/cond…
ilmarkov Nov 4, 2025
fcebc21
Add caching
ilmarkov Nov 4, 2025
65151bc
Address comments
ilmarkov Nov 5, 2025
df22202
Update benchmark
ilmarkov Nov 5, 2025
a21de2b
Fix
ilmarkov Nov 5, 2025
ada24e6
Merge branch 'imarkov/fused_allreduce_torch_native' into imarkov/cond…
ilmarkov Nov 6, 2025
6766e4f
Update fakify for compile sizes
ilmarkov Nov 5, 2025
af87d7a
Linter fix
ilmarkov Nov 6, 2025
459f71c
Merge branch 'imarkov/fused_allreduce_torch_native' into imarkov/cond…
ilmarkov Nov 6, 2025
b4c1b1d
Address the review
ilmarkov Nov 10, 2025
f080a83
[RFC][ROCm][AITER] Keep all AITER kernels in `_aiter_ops` class like …
vllmellm Nov 10, 2025
d0e186c
[V0 Deprecation] Remove unused `context_len` and `seq_len` from M-RoP…
DarkLight1337 Nov 10, 2025
a3e7bdc
Merge branch 'imarkov/fused_allreduce_torch_native' into imarkov/cond…
ilmarkov Nov 10, 2025
b039bfd
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
varun-sundar-rabindranath Nov 10, 2025
34553b9
[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3ne…
jiahanc Nov 10, 2025
6d54336
[Bugfix] Fix llguidance backend, rollback when EOS was encountered (#…
Flechman Nov 10, 2025
9c84ca8
[FA/Chore] Bump FA version for FP8 two-level accumulation (#27889)
jmkuebler Nov 10, 2025
40d3326
[Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (#…
SageMoore Nov 10, 2025
bf6a3d0
[Misc] Add more scoping for improved trace (#28329)
frank-wei Nov 10, 2025
6dec9f6
[BugFix] Fix DeepGEMM over-allocating workspace (#28254)
LucasWilkinson Nov 10, 2025
4b94ed8
[Frontend][2/n] remove empty content from _parse_tool_calls_from_cont…
qandrew Nov 10, 2025
30700b1
[CI] Fix Plugin Tests Tests (#28413)
robertgshaw2-redhat Nov 10, 2025
0211435
[ROCm] Add missing gemm_a8w8_blockscale import (#28378)
sarckk Nov 10, 2025
d17ecc6
[PERF] Allreduce fusion. Support torch native matching. Tuning of the…
ilmarkov Nov 10, 2025
b30372c
[Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for bet…
Jialin Nov 10, 2025
a5a790e
[Bugfix] Ensure calculated KV scales are applied in attention. (#27232)
adabeyta Nov 10, 2025
0bf29fa
[Test] Remove old non-varlen FA2 test (#28420)
MatthewBonanni Nov 10, 2025
35d801f
[Feature] Refactor batch invariant fp8 DeepGEMM (#27606)
yewentao256 Nov 11, 2025
39029d5
[CI/Test Fix] Fix CP tests on Blackwell (#28404)
LucasWilkinson Nov 11, 2025
de540c0
[Feature] Add env var `VLLM_MOE_USE_DEEP_GEMM` (#28422)
yewentao256 Nov 11, 2025
f2d9ad0
Only register rocm_aiter_ops if aiter is found (#28428)
mgoin Nov 11, 2025
57201a6
Fix rotary embedding benchmark script (#28323)
xyang16 Nov 11, 2025
8d706cc
[Misc] FlattenLogprobs -> FlatLogprobs (#28335)
zhuohan123 Nov 11, 2025
bca74e3
[Frontend] Add sagemaker_standards dynamic lora adapter and stateful …
zhaozuy Nov 11, 2025
e605e8e
[Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430)
robertgshaw2-redhat Nov 11, 2025
a7adbc6
[Doc] Sleep mode documentation (#28357)
iAmir97 Nov 11, 2025
cc07976
[BugFix] Avoid calling KV connector layer APIs when metadata is unset…
sdavidbd Nov 11, 2025
4fd4b74
[Bugfix] Fix max image size for PaddleOCR-VL (#28442)
ywang96 Nov 11, 2025
798c7be
[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU tra…
SageMoore Nov 11, 2025
f0359ff
[Bugfix] fix qwen3-next crash (#28202)
ZJY0516 Nov 11, 2025
c799126
[BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (#28387)
faaany Nov 11, 2025
a810969
Merge branch 'main' into imarkov/conditional_compilation_ranges
ilmarkov Nov 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -450,6 +450,7 @@ steps:
- pytest -v -s compile/test_decorator.py
- pytest -v -s compile/test_noop_elimination.py
- pytest -v -s compile/test_aot_compile.py
- pytest -v -s compile/test_compile_ranges.py

- label: PyTorch Fullgraph Smoke Test # 15min
timeout_in_minutes: 30
Expand All @@ -471,8 +472,8 @@ steps:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py
# Limit to no custom ops to reduce running time
- pytest -v -s compile/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
# Limit to no custom ops to reduce running time
# Wrap with quotes to escape yaml and avoid starting -k string with a -
- "pytest -v -s compile/test_fusions_e2e.py -k 'TRITON and -quant_fp8'"

Expand Down Expand Up @@ -951,10 +952,13 @@ steps:
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
- tests/compile/test_fusions_e2e.py
- tests/compile/test_full_graph.py
commands:
- nvidia-smi
# Run all e2e fusion tests
- pytest -v -s tests/compile/test_fusions_e2e.py
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
- pytest -v -s tests/compile/test_full_graph.py::test_fp8_kv_scale_compile

- label: Blackwell GPT-OSS Eval
timeout_in_minutes: 60
Expand Down
154 changes: 64 additions & 90 deletions benchmarks/kernels/benchmark_rope.py
Original file line number Diff line number Diff line change
@@ -1,97 +1,76 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from itertools import accumulate
import itertools

import nvtx
import torch

from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding, get_rope
from vllm.platforms import current_platform
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser

batch_size_range = [2**i for i in range(0, 8, 2)]
seq_len_range = [2**i for i in range(6, 10, 1)]
num_heads_range = [32, 48]
configs = list(itertools.product(batch_size_range, seq_len_range, num_heads_range))

def benchmark_rope_kernels_multi_lora(
is_neox_style: bool,
batch_size: int,
seq_len: int,
num_heads: int,
head_size: int,
rotary_dim: int | None,
dtype: torch.dtype,
seed: int,
device: str,
max_position: int = 8192,
base: float = 10000,
) -> None:
current_platform.seed_everything(seed)
torch.set_default_device(device)
if rotary_dim is None:
rotary_dim = head_size
# silulating serving 4 LoRAs
scaling_factors = [1, 2, 4, 8]
# batched RoPE can take multiple scaling factors
batched_rope = get_rope(
head_size,
rotary_dim,
max_position,
base,
is_neox_style,
{"rope_type": "linear", "factor": tuple(scaling_factors)},

def get_benchmark(head_size, rotary_dim, is_neox_style, device):
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["batch_size", "seq_len", "num_heads"],
x_vals=[list(_) for _ in configs],
line_arg="provider",
line_vals=["torch", "flashinfer", "vllm"],
line_names=["PyTorch", "FlashInfer", "vLLM"],
styles=[("blue", "-"), ("green", "-"), ("red", "-")],
ylabel="us",
plot_name=f"rope-perf{'-neox-style' if is_neox_style else ''}",
args={},
)
)
# non-batched RoPE takes only one scaling factor, we create multiple
# instances to simulate the same behavior
non_batched_ropes: list[RotaryEmbedding] = []
for scaling_factor in scaling_factors:
non_batched_ropes.append(
get_rope(
head_size,
rotary_dim,
max_position,
base,
is_neox_style,
{"rope_type": "linear", "factor": (scaling_factor,)},
)
def benchmark(batch_size, seq_len, num_heads, provider):
dtype = torch.bfloat16
max_position = 8192
base = 10000
rope = get_rope(head_size, rotary_dim, max_position, base, is_neox_style)
rope = rope.to(dtype=dtype, device=device)
cos_sin_cache = rope.cos_sin_cache.to(dtype=torch.float, device=device)

positions = torch.randint(0, max_position, (batch_size, seq_len), device=device)
query = torch.randn(
(batch_size, seq_len, num_heads * head_size), dtype=dtype, device=device
)
key = torch.randn_like(query)

positions = torch.randint(0, max_position, (batch_size, seq_len))
query = torch.randn(batch_size, seq_len, num_heads * head_size, dtype=dtype)
key = torch.randn_like(query)
quantiles = [0.5, 0.2, 0.8]

# create query offsets for batched RoPE, we concat multiple kv cache
# together and each query needs to find the right kv cache of its type
offset_map = torch.tensor(
list(
accumulate(
[0]
+ [
max_position * scaling_factor * 2
for scaling_factor in scaling_factors[:-1]
]
if provider == "torch":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: rope.forward_native(positions, query.clone(), key.clone()),
quantiles=quantiles,
)
)
)
query_types = torch.randint(
0, len(scaling_factors), (batch_size, seq_len), device=device
)
# map query types to offsets
query_offsets = offset_map[query_types]
# the kernel takes flattened offsets
flatten_offsets = query_offsets.flatten()
elif provider == "flashinfer":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: torch.ops.vllm.flashinfer_rotary_embedding(
positions,
query.clone(),
key.clone(),
head_size,
cos_sin_cache,
is_neox_style,
),
quantiles=quantiles,
)
else:
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: rope.forward_cuda(positions, query.clone(), key.clone()),
quantiles=quantiles,
)

return 1000 * ms, 1000 * max_ms, 1000 * min_ms

# batched queries of the same type together for non-batched RoPE
queries = [query[query_types == i] for i in range(len(scaling_factors))]
keys = [key[query_types == i] for i in range(len(scaling_factors))]
packed_qkr = zip(queries, keys, non_batched_ropes)
# synchronize before start timing
torch.cuda.synchronize()
with nvtx.annotate("non-batched", color="yellow"):
for q, k, r in packed_qkr:
r.forward(positions, q, k)
torch.cuda.synchronize()
with nvtx.annotate("batched", color="green"):
batched_rope.forward(positions, query, key, flatten_offsets)
torch.cuda.synchronize()
return benchmark


if __name__ == "__main__":
Expand All @@ -116,17 +95,12 @@ def benchmark_rope_kernels_multi_lora(
parser.add_argument(
"--device", type=str, choices=["cuda:0", "cuda:1"], default="cuda:0"
)
parser.add_argument("--save-path", type=str, default="./configs/rope/")
args = parser.parse_args()
print(args)

benchmark_rope_kernels_multi_lora(
is_neox_style=args.is_neox_style,
batch_size=args.batch_size,
seq_len=args.seq_len,
num_heads=args.num_heads,
head_size=args.head_size,
rotary_dim=args.rotary_dim,
dtype=getattr(torch, args.dtype),
seed=args.seed,
device=args.device,
# Get the benchmark function
benchmark = get_benchmark(
args.head_size, args.rotary_dim, args.is_neox_style, args.device
)
# Run performance benchmark
benchmark.run(print_data=True, save_path=args.save_path)
2 changes: 1 addition & 1 deletion cmake/external_projects/vllm_flash_attn.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG a893712401d70362fbb299cd9c4b3476e8e9ed54
GIT_TAG 8e1b01d56210dc72030a2d0d41c2d8d266ba6309
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
Expand Down
15 changes: 10 additions & 5 deletions csrc/quantization/activation_kernels.cu
Original file line number Diff line number Diff line change
Expand Up @@ -578,11 +578,13 @@ void persistent_masked_m_silu_mul_quant(

// This kernel currently only supports H % 128 == 0 and assumes a
// fixed GROUP_SIZE of 128.
static constexpr int GROUP_SIZE = 128;

TORCH_CHECK(input.dtype() == torch::kBFloat16);
TORCH_CHECK(y_q.dtype() == torch::kFloat8_e4m3fn ||
y_q.dtype() == torch::kFloat8_e4m3fnuz);
TORCH_CHECK(y_s.dtype() == torch::kFloat32);
TORCH_CHECK(input.size(-1) % 256 == 0);
TORCH_CHECK(input.size(-1) % (GROUP_SIZE * 2) == 0);

using Idx_t = int64_t;

Expand All @@ -601,8 +603,6 @@ void persistent_masked_m_silu_mul_quant(

Idx_t stride_counts_e = tokens_per_expert.stride(0);

static constexpr int GROUP_SIZE = 128;

const cudaStream_t stream = at::cuda::getCurrentCUDAStream();

#define KERNEL(BLOCK_COUNT, USE_UE8M0, THREAD_COUNT, STAGES) \
Expand All @@ -628,21 +628,26 @@ void persistent_masked_m_silu_mul_quant(

static constexpr int SILU_V2_BLOCK_COUNT = 132 * 32;

int const NUM_GROUPS = H / GROUP_SIZE;
if (!use_ue8m0) {
if (H >= 4096) {
if (H >= 4096 && (NUM_GROUPS % 8 == 0)) {
/* 8 warps config */
static constexpr int NUM_STAGES = 4;
static constexpr int THREAD_COUNT = 256;
KERNEL(SILU_V2_BLOCK_COUNT, false, THREAD_COUNT, NUM_STAGES);
} else {
/* 1 warp config */
static constexpr int THREAD_COUNT = 32;
KERNEL(SILU_V2_BLOCK_COUNT, false, THREAD_COUNT, 2);
}
} else {
if (H >= 4096) {
if (H >= 4096 && (NUM_GROUPS % 8 == 0)) {
/* 8 warps config */
static constexpr int NUM_STAGES = 4;
static constexpr int THREAD_COUNT = 256;
KERNEL(SILU_V2_BLOCK_COUNT, true, THREAD_COUNT, NUM_STAGES);
} else {
/* 1 warp config */
static constexpr int THREAD_COUNT = 32;
KERNEL(SILU_V2_BLOCK_COUNT, true, THREAD_COUNT, 2);
}
Expand Down
2 changes: 1 addition & 1 deletion docs/design/moe_kernel_features.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ To be used with a particular `FusedMoEPrepareAndFinalize` sub-class, MoE kernels
| trtllm | standard | mxfp4,</br>nvfp4 | G(16),G(32) | <sup>5</sup> | N | Y | [`TrtLlmGenExperts`][vllm.model_executor.layers.fused_moe.trtllm_moe.TrtLlmGenExperts] |
| pallas | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_pallas.fused_moe] |
| iterative | standard | N/A | N/A | silu | N | N | [`fused_moe`][vllm.model_executor.layers.fused_moe.moe_torch_iterative.fused_moe] |
| rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_moe_impl] |
| rocm aiter moe | standard | fp8 | G(128),A,T | silu, gelu | Y | N | [`rocm_aiter_fused_experts`][vllm.model_executor.layers.fused_moe.rocm_aiter_fused_moe.rocm_aiter_fused_experts] |
| cpu_fused_moe | standard | N/A | N/A | silu | N | N | [`CPUFusedMOE`][vllm.model_executor.layers.fused_moe.cpu_fused_moe.CPUFusedMOE] |
| naive batched<sup>4</sup> | batched | int8,</br>fp8 | G,A,T | silu, gelu | <sup>6</sup> | Y | [`NaiveBatchedExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.NaiveBatchedExperts] |

Expand Down
39 changes: 39 additions & 0 deletions docs/features/sleep_mode.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ Key benefits:
!!! note
This feature is only supported on CUDA platform.

!!! note
For more information, see this [Blog Post](https://blog.vllm.ai/2025/10/26/sleep-mode.html).

## Sleep levels

Level 1 sleep will offload the model weights and discard the KV cache. The content of KV cache is forgotten. Level 1 sleep is good for sleeping and waking up the engine to run the same model again. The model weights are backed up in CPU memory. Please make sure there's enough CPU memory to store the model weights. Level 2 sleep will discard both the model weights and the KV cache (while the model's buffers are kept in CPU, like rope scaling tensors). The content of both the model weights and KV cache is forgotten. Level 2 sleep is good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed, e.g. RLHF weight update.
Expand All @@ -31,13 +34,29 @@ llm = LLM("Qwen/Qwen3-0.6B", enable_sleep_mode=True)
#### Python API

```python
# Sleep level 1
# Put the engine to sleep (level=1: offload weights to CPU RAM, discard KV cache)
llm.sleep(level=1)

# Wake up the engine (restore weights)
llm.wake_up()
```

```python
# Sleep level 2
# Put the engine to sleep (level=2: discard both weights and KV cache)
llm.sleep(level=2)

# Reallocate weights memory only
llm.wake_up(tags=["weights"])

# Load weights in-place
llm.collective_rpc("reload_weights")

# Reallocate KV cache
llm.wake_up(tags=["kv_cache"])
```

#### RLHF weight updates

During RLHF training, vLLM allows you to selectively wake up only the model weights or the KV cache using the tags argument in wake_up(). This fine-grained control is especially useful when updating model weights: by waking up just the weights (e.g., llm.wake_up(tags=["weights"])), you avoid allocating memory for the KV cache until after the weight update is complete. This approach helps prevent GPU out-of-memory (OOM) errors, particularly with large models, by minimizing peak memory usage during weight synchronization and update operations.
Expand Down Expand Up @@ -69,10 +88,30 @@ VLLM_SERVER_DEV_MODE=1 vllm serve Qwen/Qwen3-0.6B \
--port 8000
```

Below is an example of how to sleep and wake up a model in level 1.

```bash
curl -X POST 'http://localhost:8000/sleep?level=1'
curl -X POST 'http://localhost:8000/wake_up'
```

And this is an example of how to sleep and wake up a model in level 2.

```bash
curl -X POST 'http://localhost:8000/sleep?level=2'
# Reallocate weights memory only
curl -X POST 'http://localhost:8000/wake_up?tags=weights'
# Load weights in-place
curl -X POST 'http://localhost:8000/collective_rpc' -H 'Content-Type: application/json' -d '{"method":"reload_weights"}'
# Reallocate KV cache
curl -X POST 'http://localhost:8000/wake_up?tags=kv_cache'
```

#### HTTP endpoints

- `POST /sleep?level=1` — Put the model to sleep (`level=1`).
- `POST /wake_up` — Wake up the model. Supports optional `tags` query parameters for partial wake-up (e.g., `?tags=weights`).
- `POST /collective_rpc` — Perform a collective remote procedure call (RPC).
- `GET /is_sleeping` — Check if the model is sleeping.

!!! note
Expand Down
1 change: 1 addition & 0 deletions requirements/common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,4 @@ cbor2 # Required for cross-language serialization of hashable objects
setproctitle # Used to set process names for better debugging and monitoring
openai-harmony >= 0.0.3 # Required for gpt-oss
anthropic == 0.71.0
model-hosting-container-standards < 1.0.0
Loading