Skip to content

Conversation

@bradleyhd
Copy link
Contributor

@bradleyhd bradleyhd commented Nov 14, 2025

Purpose

To fix the failing EPLB execution test for AMD CI. This test uses multiprocessing (defaulting to fork method), but tries to set cuda devices in each process. NV seems to be forgiving of this, but ROCM runtime is more strict. "spawn" MP start method should be used. In order to accomplish this, I've removed most local functions (so MP can pickle them), and set the MP start method to 'spawn'.

Test Plan

AMD CI. test_eplb_execute passes (https://buildkite.com/vllm/amd-ci/builds/1064/steps/canvas?jid=019a84a1-e8cb-444f-bd2b-f7f239a02978).
Please note the overall test still fails, in a second command against distributed/test_eplb_spec_decode.py. This is a more complex fix and can be addressed in a follow-up PR

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Bradley Davis <bradleyhd@meta.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Nov 14, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
@bradleyhd bradleyhd marked this pull request as ready for review November 19, 2025 19:41
@bradleyhd
Copy link
Contributor Author

cc @zhewenl @tjtanaa

Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025
@tjtanaa tjtanaa merged commit 1e1c067 into vllm-project:main Nov 20, 2025
23 checks passed
@micah-wil
Copy link
Contributor

Any particular reason for the switch to torch.multiprocessing? I was looking into this test as well and found that using the "dill" pickling backend (which is what is used by import multiprocess as mp instead of import multiprocessing as mp) successfully pickles the nested functions without having to refactor the file (still need to set mp.set_start_method('spawn') though). Not sure if there is some advantage to using torch.multiprocessing instead, I am mostly just asking out of curiosity.

As for why this test succeeded on NVIDIA but not AMD, I think this might be a clue:

On ROCm (MI300X):

Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_initialized()
False
>>> from vllm.distributed.eplb.rebalance_execute import rearrange_expert_weights_inplace
>>> torch.cuda.is_initialized()
True
>>>

On NVIDIA H100:

Python 3.12.12 (main, Oct 14 2025, 21:25:31) [Clang 20.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_initialized()
False
>>> from vllm.distributed.eplb.rebalance_execute import rearrange_expert_weights_inplace
>>> torch.cuda.is_initialized()
False
>>>

For some reason, I think the vLLM imports trigger CUDA initialization on ROCm but not NVIDIA. Not sure why yet, maybe this is expected.

Thanks for the fix!

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: LuminolT <lumischen01@gmail.com>
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants