-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction #25205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
05baf79 to
96f5d79
Compare
0ce2620 to
38b6517
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
db326a9 to
03fa8b3
Compare
zhxchen17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dolpm Sorry for the delayed review. Just some initial thoughts.
Summary: Fixing issue vllm-project#27348 For dynamo caching, it's possible that the compilation succeeds but the serialization step fails. In this case, the failure of serialization step shouldn't block user from getting compilation results correctly. Therefore we add a handling of the serialization error and only give warning when model saving fails. When saving fails, VLLM model runner should be able to just fallback to the old path, and in the next process, it will fail to load dynamo cache but still fallback to retracing with dynamo + loading inductor cache, which is the same behavior to AOT compile turned of off. This is mostly a short term fix and in the long term we should resolve the serialization bugs by eliminating pickling of graph modules. i.e. Once vllm-project#25205 is merged, we should be able to resolve the issue at a lower level. Test Plan: pytest tests/lora/test_quant_model.py Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
03fa8b3 to
579ec82
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
579ec82 to
869b3fe
Compare
Summary: Fixing issue vllm-project#27348 For dynamo caching, it's possible that the compilation succeeds but the serialization step fails. In this case, the failure of serialization step shouldn't block user from getting compilation results correctly. Therefore we add a handling of the serialization error and only give warning when model saving fails. When saving fails, VLLM model runner should be able to just fallback to the old path, and in the next process, it will fail to load dynamo cache but still fallback to retracing with dynamo + loading inductor cache, which is the same behavior to AOT compile turned of off. This is mostly a short term fix and in the long term we should resolve the serialization bugs by eliminating pickling of graph modules. i.e. Once vllm-project#25205 is merged, we should be able to resolve the issue at a lower level. Test Plan: pytest tests/lora/test_quant_model.py Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>
869b3fe to
7465ae3
Compare
cdb89d9 to
7d75e72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9c2eb6d to
ed28803
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com>
ed28803 to
fca965f
Compare
|
@zou3519 bump |
Adds support for inductor artifacts in VllmCompiledFunction
Purpose
we want to ensure that all pre-compiled artifacts (e.g., dynamo+inductor) live under one roof (i.e., VllmSerializableFunction).
the general implementation is as follows.
Test Result
python benchmarks/compile/benchmark_inductor_compiled_artifacts.py --max-tokens 16 --include-baseline
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.