[Bug]: DSR1 FP4 + DEP8 on B200 fails with TensorRT-LLM throughput kernels

### Your current environment

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-57-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration :
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version        : 580.65.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64


Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.3.1
[pip3] numpy==2.2.6 
[pip3] nvidia-cublas-cu12==12.8.4.1 
[pip3] nvidia-cuda-cupti-cu12==12.8.90 
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90 
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.14.1
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cu128
[pip3] torchaudio==2.8.0+cu128
[pip3] torchvision==0.23.0+cu128
[pip3] transformers==4.56.2
[pip3] triton==3.4.0
[conda] Could not collect

============================== 
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.1rc1.dev118+g1726e93ef (git sha: 1726e93ef)
vLLM Build Flags: 
<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py", line 202, in finalize
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]     fused_expert_output = get_dp_group().reduce_scatterv(
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 423, in reduce_scatterv
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]     return self.device_communicator.reduce_scatterv(input_, dim, sizes)
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 207, in reduce_scatterv
[1;36m(Worker_DP1_EP1 pid=746595)[0;0m ERROR 10-01 00:16:25 [multiproc_executor.py:671]     assert input_tensor.shape[0] == sum(sizes)

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: DSR1 FP4 + DEP8 on B200 fails with TensorRT-LLM throughput kernels #26070

Your current environment

Collecting environment information...

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

Versions of relevant libraries

==============================
vLLM Info

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: DSR1 FP4 + DEP8 on B200 fails with TensorRT-LLM throughput kernels #26070

Description

Your current environment

Collecting environment information...

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

Versions of relevant libraries

============================== vLLM Info

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
vLLM Info