From 35f52a6c1fd5fcd3379335b4b414cb970f672f17 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Mon, 1 Dec 2025 15:56:47 +0800 Subject: [PATCH 1/5] Add Qwen3-Next tutorials Signed-off-by: ming1212 <2717180080@qq.com> --- ...{multi_npu_qwen3_next.md => Qwen3-Next.md} | 86 ++++++++++++++----- docs/source/tutorials/index.md | 2 +- 2 files changed, 64 insertions(+), 24 deletions(-) rename docs/source/tutorials/{multi_npu_qwen3_next.md => Qwen3-Next.md} (55%) diff --git a/docs/source/tutorials/multi_npu_qwen3_next.md b/docs/source/tutorials/Qwen3-Next.md similarity index 55% rename from docs/source/tutorials/multi_npu_qwen3_next.md rename to docs/source/tutorials/Qwen3-Next.md index eeb57f5a6fa..ac4128844b6 100644 --- a/docs/source/tutorials/multi_npu_qwen3_next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -1,15 +1,27 @@ -# Multi-NPU (Qwen3-Next) +# Qwen3-Next -```{note} -The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. -``` +## Introduction + +The Qwen3-Next model is a sparse MoE (Mixture of Experts) model with high sparsity. Compared to the MoE architecture of Qwen3, it has introduced key improvements in aspects such as the hybrid attention mechanism and multi-token prediction mechanism, enhancing the training and inference efficiency of the model under long contexts and large total parameter scales. + +This document will present the core verification steps of the model, including supported features, environment preparation, as well as accuracy and performance evaluation. Qwen3 Next is currently using Triton Ascend, which is in the experimental phase. In subsequent versions, its performance related to stability and accuracy may change, and performance will be continuously optimized. + +The `Qwen3-Next` model is first supported in `vllm-ascend:v0.10.2rc1`. + +## Supported Features + +Refer to [supported features](../user_guide/support_matrix/supported_models.md) to get the model's supported feature matrix. + +Refer to [feature guide](../user_guide/feature_guide/index.md) to get the feature's configuration. + +## Weight Preparation -## Run vllm-ascend on Multi-NPU with Qwen3 Next + Download Link for the `Qwen3-Next-80B-A3B-Instruct` Model Weights: [Download model weight](https://modelers.cn/models/Modelers_Park/Qwen3-Next-80B-A3B-Instruct/tree/main) -Run docker container: +## Deployment +### Run docker container ```{code-block} bash - :substitutions: # Update the vllm-ascend image export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| docker run --rm \ @@ -32,12 +44,7 @@ docker run --rm \ -it $IMAGE bash ``` -Set up environment variables: - -```bash -# Load model from ModelScope to speed up download -export VLLM_USE_MODELSCOPE=True -``` +The Qwen3 Next is using [Triton Ascend](https://gitee.com/ascend/triton-ascend) which is currently experimental. In future versions, there may be behavioral changes related to stability, accuracy, and performance improvement. ### Install Triton Ascend @@ -49,14 +56,17 @@ The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you Install the Ascend BiSheng toolkit: ```bash -source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run +chmod a+x Ascend-BiSheng-toolkit_aarch64.run +./Ascend-BiSheng-toolkit_aarch64.run --install +source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh ``` Install Triton Ascend: ```bash -wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl -pip install triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl +pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl ``` :::: @@ -68,13 +78,7 @@ Coming soon ... :::: ::::: -### Inference on Multi-NPU - -Please make sure you have already executed the command: - -```bash -source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh -``` +### Inference :::::{tab-set} ::::{tab-item} Online Inference @@ -152,3 +156,39 @@ Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I :::: ::::: + + +## Accuracy Evaluation + + +### Using AISBench + + Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. + + + +## Performance + +### Using AISBench + +Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. + +### Using vLLM Benchmark + +Run performance evaluation of `Qwen3-Next` as an example. + +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```shell +export VLLM_USE_MODELSCOPE=true +vllm bench serve --model Qwen/Qwen3-Next-80B-A3B-Instruct --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +``` + +After about several minutes, you can get the performance evaluation result. diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index 71fa2815ddc..ecacbc94ec7 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -10,7 +10,7 @@ single_npu_qwen3_embedding single_npu_qwen3_quantization single_npu_qwen3_w4a4 single_node_pd_disaggregation_llmdatadist -multi_npu_qwen3_next +Qwen3-Next multi_npu multi_npu_moge multi_npu_qwen3_moe From c12e96dfee13330cdb3b95ae427330633af0eda5 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Mon, 1 Dec 2025 17:38:51 +0800 Subject: [PATCH 2/5] update Signed-off-by: ming1212 <2717180080@qq.com> --- docs/source/tutorials/Qwen3-Next.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index ac4128844b6..d35978dc265 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -157,16 +157,12 @@ Prompt: 'Who are you?', Generated text: ' What do you know about me?\n\nHello! I :::: ::::: - ## Accuracy Evaluation - ### Using AISBench Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. - - ## Performance ### Using AISBench From 92539167f2b0b1cb461a8bc59e6bd79bb116d3f5 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Thu, 4 Dec 2025 18:25:33 +0800 Subject: [PATCH 3/5] Update triton package name Signed-off-by: ming1212 <2717180080@qq.com> --- docs/source/tutorials/Qwen3-Next.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index d35978dc265..6ba598a2e5f 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -56,17 +56,14 @@ The [Triton Ascend](https://gitee.com/ascend/triton-ascend) is required when you Install the Ascend BiSheng toolkit: ```bash -wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/Ascend-BiSheng-toolkit_aarch64.run -chmod a+x Ascend-BiSheng-toolkit_aarch64.run -./Ascend-BiSheng-toolkit_aarch64.run --install -source /usr/local/Ascend/8.3.RC1/bisheng_toolkit/set_env.sh +source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh ``` Install Triton Ascend: ```bash -wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl -pip install triton_ascend-3.2.0.dev20250914-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl +wget https://vllm-ascend.obs.cn-north-4.myhuaweicloud.com/vllm-ascend/triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl +pip install triton_ascend-3.2.0.dev2025110717-cp311-cp311-manylinux_2_27_aarch64.whl ``` :::: From 9a74cde7642b19c46930cb66e9bd594417796fa5 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Thu, 4 Dec 2025 18:36:26 +0800 Subject: [PATCH 4/5] update Signed-off-by: ming1212 <2717180080@qq.com> --- docs/source/tutorials/Qwen3-Next.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 6ba598a2e5f..4c34edceea7 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -77,6 +77,12 @@ Coming soon ... ### Inference +Please make sure you have already executed the command: + +```bash +source /usr/local/Ascend/ascend-toolkit/8.3.RC2/bisheng_toolkit/set_env.sh +``` + :::::{tab-set} ::::{tab-item} Online Inference From ea7a96e6e98241f3e8bdc20200469af4af118513 Mon Sep 17 00:00:00 2001 From: ming1212 <2717180080@qq.com> Date: Fri, 5 Dec 2025 19:50:54 +0800 Subject: [PATCH 5/5] update Signed-off-by: ming1212 <2717180080@qq.com> --- docs/source/tutorials/Qwen3-Next.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/tutorials/Qwen3-Next.md b/docs/source/tutorials/Qwen3-Next.md index 4c34edceea7..97f2b25fda5 100644 --- a/docs/source/tutorials/Qwen3-Next.md +++ b/docs/source/tutorials/Qwen3-Next.md @@ -22,6 +22,7 @@ Refer to [feature guide](../user_guide/feature_guide/index.md) to get the featur ### Run docker container ```{code-block} bash + :substitutions: # Update the vllm-ascend image export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version| docker run --rm \