-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[ci][amd] fix EPLB execution test #28742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
b64fe6f to
4fdb61a
Compare
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
tjtanaa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Any particular reason for the switch to torch.multiprocessing? I was looking into this test as well and found that using the "dill" pickling backend (which is what is used by As for why this test succeeded on NVIDIA but not AMD, I think this might be a clue: For some reason, I think the vLLM imports trigger CUDA initialization on ROCm but not NVIDIA. Not sure why yet, maybe this is expected. Thanks for the fix! |
Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: LuminolT <lumischen01@gmail.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
Purpose
To fix the failing EPLB execution test for AMD CI. This test uses multiprocessing (defaulting to fork method), but tries to set cuda devices in each process. NV seems to be forgiving of this, but ROCM runtime is more strict. "spawn" MP start method should be used. In order to accomplish this, I've removed most local functions (so MP can pickle them), and set the MP start method to 'spawn'.
Test Plan
AMD CI.
test_eplb_executepasses (https://buildkite.com/vllm/amd-ci/builds/1064/steps/canvas?jid=019a84a1-e8cb-444f-bd2b-f7f239a02978).Please note the overall test still fails, in a second command against
distributed/test_eplb_spec_decode.py. This is a more complex fix and can be addressed in a follow-up PRTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.