Skip to content

Commit f10acdd

Browse files
authored
drop ascend scheduler (#4498)
Ascend scheduler was added for non chunk prefill case before, since that the npu ops didn't work well with chunked prefill. Now the ops with chunked prefill work better, it's time to remove the ascend scheduler to use vLLM default scheduler. - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
1 parent 53a52d6 commit f10acdd

File tree

52 files changed

+85
-2948
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+85
-2948
lines changed

.github/workflows/_e2e_test.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,10 +91,8 @@ jobs:
9191
pytest -sv tests/e2e/singlecard/test_completion_with_prompt_embeds.py
9292
pytest -sv tests/e2e/singlecard/test_aclgraph.py
9393
pytest -sv tests/e2e/singlecard/test_aclgraph_mem.py
94-
pytest -sv tests/e2e/singlecard/test_ascend_scheduler.py
9594
pytest -sv tests/e2e/singlecard/test_bge_model.py
9695
pytest -sv tests/e2e/singlecard/test_camem.py
97-
pytest -sv tests/e2e/singlecard/test_chunked.py
9896
pytest -sv tests/e2e/singlecard/test_embedding.py
9997
# pytest -sv tests/e2e/singlecard/test_embedding_aclgraph.py
10098
pytest -sv tests/e2e/singlecard/test_guided_decoding.py

docs/source/tutorials/DeepSeek-V3.2-Exp.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
108108
--trust-remote-code \
109109
--no-enable-prefix-caching \
110110
--gpu-memory-utilization 0.92 \
111-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
111+
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
112112
```
113113

114114
### Multi-node Deployment
@@ -160,7 +160,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
160160
--trust-remote-code \
161161
--no-enable-prefix-caching \
162162
--gpu-memory-utilization 0.9 \
163-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
163+
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
164164
```
165165

166166
**Node 1**
@@ -204,7 +204,7 @@ vllm serve /root/.cache/Modelers_Park/DeepSeek-V3.2-Exp \
204204
--trust-remote-code \
205205
--no-enable-prefix-caching \
206206
--gpu-memory-utilization 0.92 \
207-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
207+
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
208208
```
209209

210210
::::
@@ -252,7 +252,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
252252
--quantization ascend \
253253
--no-enable-prefix-caching \
254254
--gpu-memory-utilization 0.9 \
255-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
255+
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
256256
```
257257

258258
**Node 1**
@@ -299,7 +299,7 @@ vllm serve vllm-ascend/DeepSeek-V3.2-Exp-W8A8 \
299299
--quantization ascend \
300300
--no-enable-prefix-caching \
301301
--gpu-memory-utilization 0.92 \
302-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
302+
--additional-config '{"torchair_graph_config":{"enabled":true,"graph_batch_sizes":[16]}}'
303303
```
304304

305305
::::

docs/source/tutorials/multi_node.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
137137
--trust-remote-code \
138138
--no-enable-prefix-caching \
139139
--gpu-memory-utilization 0.9 \
140-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
140+
--additional-config '{"torchair_graph_config":{"enabled":true}}'
141141
```
142142

143143
**Node 1**
@@ -182,7 +182,7 @@ vllm serve vllm-ascend/DeepSeek-V3.1-W8A8 \
182182
--trust-remote-code \
183183
--no-enable-prefix-caching \
184184
--gpu-memory-utilization 0.92 \
185-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
185+
--additional-config '{"torchair_graph_config":{"enabled":true}}'
186186
```
187187

188188
The deployment view looks like:

docs/source/tutorials/multi_node_kimi.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
9393
--trust-remote-code \
9494
--no-enable-prefix-caching \
9595
--gpu-memory-utilization 0.9 \
96-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
96+
--additional-config '{"torchair_graph_config":{"enabled":true}}'
9797
```
9898

9999
**Node 1**
@@ -137,7 +137,7 @@ vllm serve /home/cache/weights/Kimi-K2-Instruct-W8A8 \
137137
--trust-remote-code \
138138
--no-enable-prefix-caching \
139139
--gpu-memory-utilization 0.92 \
140-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
140+
--additional-config '{"torchair_graph_config":{"enabled":true}}'
141141
```
142142

143143
The deployment view looks like:

docs/source/tutorials/multi_npu_moge.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -158,11 +158,6 @@ if __name__ == "__main__":
158158
'torchair_graph_config': {
159159
'enabled': True,
160160
},
161-
'ascend_scheduler_config':{
162-
'enabled': True,
163-
'enable_chunked_prefill' : False,
164-
'chunked_prefill_enabled': False
165-
},
166161
})
167162
168163
outputs = llm.generate(prompts, sampling_params)

docs/source/user_guide/configuration/additional_config.md

Lines changed: 0 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ The following table lists additional configuration options available in vLLM Asc
2727
| Name | Type | Default | Description |
2828
|-------------------------------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------|
2929
| `torchair_graph_config` | dict | `{}` | Configuration options for torchair graph mode |
30-
| `ascend_scheduler_config` | dict | `{}` | Configuration options for ascend scheduler |
3130
| `weight_prefetch_config` | dict | `{}` | Configuration options for weight prefetch |
3231
| `refresh` | bool | `false` | Whether to refresh global Ascend configuration content. This is usually used by rlhf or ut/e2e test case. |
3332
| `expert_map_path` | str | `None` | When using expert load balancing for an MoE model, an expert map path needs to be passed in. |
@@ -61,18 +60,6 @@ The details of each configuration option are as follows:
6160
| `enable_kv_nz`| bool | `False` | Whether to enable KV Cache NZ layout. This option only takes effect on models using MLA (for example, DeepSeek). |
6261
| `enable_super_kernel` | bool | `False` | Whether to enable super kernel to fuse operators in deepseek moe layers. This option only takes effects on moe models using dynamic w8a8 quantization.|
6362

64-
**ascend_scheduler_config**
65-
66-
| Name | Type | Default | Description |
67-
| ---- | ---- | ------- | ----------- |
68-
| `enabled` | bool | `False` | Whether to enable ascend scheduler for V1 engine.|
69-
| `enable_pd_transfer` | bool | `False` | Whether to enable P-D transfer. When it is enabled, decode is started only when prefill of all requests is done. This option only takes effect on offline inference. |
70-
| `decode_max_num_seqs` | int | `0` | Whether to change max_num_seqs of decode phase when P-D transfer is enabled. This option only takes effect when enable_pd_transfer is True. |
71-
| `max_long_partial_prefills` | Union[int, float] | `float('inf')` | The maximum number of prompts longer than long_prefill_token_threshold that will be prefilled concurrently. |
72-
| `long_prefill_token_threshold` | Union[int, float] | `float('inf')` | a request is considered long if the prompt is longer than this number of tokens. |
73-
74-
ascend_scheduler_config also supports the options from [vllm scheduler config](https://docs.vllm.ai/en/stable/api/vllm/config.html#vllm.config.SchedulerConfig). For example, you can add `enable_chunked_prefill: True` to ascend_scheduler_config as well.
75-
7663
**weight_prefetch_config**
7764

7865
| Name | Type | Default | Description |
@@ -93,12 +80,6 @@ An example of additional configuration is as follows:
9380
"graph_batch_sizes_init": False,
9481
"enable_kv_nz": False
9582
},
96-
"ascend_scheduler_config": {
97-
"enabled": True,
98-
"enable_chunked_prefill": True,
99-
"max_long_partial_prefills": 1,
100-
"long_prefill_token_threshold": 4096,
101-
},
10283
"weight_prefetch_config": {
10384
"enabled": True,
10485
"prefetch_ratio": {

docs/source/user_guide/feature_guide/graph_mode.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,14 @@ import os
4545
from vllm import LLM
4646

4747
# TorchAirGraph only works without chunked-prefill now
48-
model = LLM(model="path/to/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True},"ascend_scheduler_config": {"enabled": True}})
48+
model = LLM(model="path/to/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enabled": True}})
4949
outputs = model.generate("Hello, how are you?")
5050
```
5151

5252
Online example:
5353

5454
```shell
55-
vllm serve path/to/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true},"ascend_scheduler_config": {"enabled": true}}'
55+
vllm serve path/to/DeepSeek-R1-0528 --additional-config='{"torchair_graph_config": {"enabled": true}}'
5656
```
5757

5858
You can find more details about additional configuration [here](../configuration/additional_config.md).

examples/offline_inference_npu_long_seq.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@
4242
enable_chunked_prefill=False,
4343
max_num_batched_tokens=2048,
4444
max_model_len=1024,
45-
additional_config={"ascend_scheduler_config": {"enabled": False}},
4645
max_num_seqs=1,
4746
block_size=128,
4847
gpu_memory_utilization=0.9

examples/run_dp_server.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,4 @@ vllm serve Qwen/Qwen1.5-MoE-A2.7B \
2828
--gpu-memory-utilization 0.9 \
2929
--trust-remote-code \
3030
--enforce-eager \
31-
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":false, "use_cached_graph":false}}'
31+
--additional-config '{"torchair_graph_config":{"enabled":false, "use_cached_graph":false}}'

tests/e2e/310p/test_offline_inference_parallel_310p.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,12 @@
2424
MODELS = [
2525
"IntervitensInc/pangu-pro-moe-model",
2626
]
27-
# set additional config for ascend scheduler and torchair graph
27+
# set additional config for torchair graph
2828
ADDITIONAL_CONFIG = [{
2929
"additional_config": {
3030
"torchair_graph_config": {
3131
"enabled": True
3232
},
33-
"ascend_scheduler_config": {
34-
"enabled": True,
35-
}
3633
}
3734
}]
3835

0 commit comments

Comments
 (0)