[ci][amd] fix EPLB execution test #28742

bradleyhd · 2025-11-14T18:20:19Z

Purpose

To fix the failing EPLB execution test for AMD CI. This test uses multiprocessing (defaulting to fork method), but tries to set cuda devices in each process. NV seems to be forgiving of this, but ROCM runtime is more strict. "spawn" MP start method should be used. In order to accomplish this, I've removed most local functions (so MP can pickle them), and set the MP start method to 'spawn'.

Test Plan

AMD CI. test_eplb_execute passes (https://buildkite.com/vllm/amd-ci/builds/1064/steps/canvas?jid=019a84a1-e8cb-444f-bd2b-f7f239a02978).
Please note the overall test still fails, in a second command against distributed/test_eplb_spec_decode.py. This is a more complex fix and can be addressed in a follow-up PR

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

bradleyhd · 2025-11-19T19:43:19Z

cc @zhewenl @tjtanaa

tjtanaa

LGTM

micah-wil · 2025-11-20T20:26:38Z

Any particular reason for the switch to torch.multiprocessing? I was looking into this test as well and found that using the "dill" pickling backend (which is what is used by import multiprocess as mp instead of import multiprocessing as mp) successfully pickles the nested functions without having to refactor the file (still need to set mp.set_start_method('spawn') though). Not sure if there is some advantage to using torch.multiprocessing instead, I am mostly just asking out of curiosity.

As for why this test succeeded on NVIDIA but not AMD, I think this might be a clue:

On ROCm (MI300X):

Python 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_initialized()
False
>>> from vllm.distributed.eplb.rebalance_execute import rearrange_expert_weights_inplace
>>> torch.cuda.is_initialized()
True
>>>

On NVIDIA H100:

Python 3.12.12 (main, Oct 14 2025, 21:25:31) [Clang 20.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_initialized()
False
>>> from vllm.distributed.eplb.rebalance_execute import rearrange_expert_weights_inplace
>>> torch.cuda.is_initialized()
False
>>>

For some reason, I think the vLLM imports trigger CUDA initialization on ROCm but not NVIDIA. Not sure why yet, maybe this is expected.

Thanks for the fix!

Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

set spawn start method for torch mp

4fdb61a

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

mergify bot added the rocm Related to AMD ROCm label Nov 14, 2025

bradleyhd force-pushed the eplb_execution branch from b64fe6f to 4fdb61a Compare November 14, 2025 20:15

bradleyhd added 5 commits November 14, 2025 12:15

try without unpickleable wrapper

49ad17a

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

use all top-level functions

79fbc66

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

set_start_method in main

8a8c17d

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

add missing arg

226139d

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

use force for setting spawn method

7d824b4

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

bradleyhd marked this pull request as ready for review November 19, 2025 19:41

zhewenl requested review from gshtras, houseroad, tjtanaa and yeqcharlotte November 19, 2025 23:30

tjtanaa approved these changes Nov 20, 2025

View reviewed changes

tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 20, 2025

tjtanaa merged commit 1e1c067 into vllm-project:main Nov 20, 2025
23 checks passed

micah-wil mentioned this pull request Nov 20, 2025

[ROCm][CI] Fix "Cannot re-initialize CUDA in forked subprocess" in test_pynccl.py #29119

Merged

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

6529a67

Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

eed024f

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

b39921b

Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

32d2777

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

AndreasKaratzas mentioned this pull request Nov 26, 2025

[CI Failure]: mi325_4: EPLB Execution Test #29514

Open

3 tasks

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

17fcbfa

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[ci][amd] fix EPLB execution test (vllm-project#28742)

10b1d44

Signed-off-by: Bradley Davis <bradleyhd@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ci][amd] fix EPLB execution test #28742

[ci][amd] fix EPLB execution test #28742

Uh oh!

bradleyhd commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

bradleyhd commented Nov 19, 2025

Uh oh!

tjtanaa left a comment

Uh oh!

Uh oh!

micah-wil commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[ci][amd] fix EPLB execution test #28742

[ci][amd] fix EPLB execution test #28742

Uh oh!

Conversation

bradleyhd commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

bradleyhd commented Nov 19, 2025

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

micah-wil commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bradleyhd commented Nov 14, 2025 •

edited by github-actions bot

Loading