Skip to content

Conversation

@dolpm
Copy link
Contributor

@dolpm dolpm commented Sep 18, 2025

Adds support for inductor artifacts in VllmCompiledFunction

Purpose

we want to ensure that all pre-compiled artifacts (e.g., dynamo+inductor) live under one roof (i.e., VllmSerializableFunction).

the general implementation is as follows.

  1. in the compile decorator, add support to only serialize/save the VllmCompiledFunction after all user-specified-shapes have been compiled. this will be done after the warmup phase when we encounter a cache miss.
  2. add a to_bytes method to cuda_piecewise_backend which will return a mapping from each shape to it's compiled bytes (using the function mentioned above).
  3. when we serialize the VllmCompiledFunction, we iterate through all the compiled subgraphs and add these to a cache which can be serialized/deserialized and will dedupe the byte blobs. during deserialization, we will bulk-load these to get all of the runnables, and then re-map them to the submodule/shape pairs under which they were serialized originally.
  4. when we load from the cache, we re-construct the split gm using the cached artifacts. we will pick the runnables from this cache directly. no need to warmup to re-map, since we know the mapping from shape->runnable at construction time, which will give us some perf.

Test Result

python benchmarks/compile/benchmark_inductor_compiled_artifacts.py --max-tokens 16 --include-baseline

====================================================================================================================================================================================

Model                                         OLD: Baseline   NEW: Inductor   BENEFIT      Time Saved   Miss Penalty
(Approach being compared)                     AOT Cache       Cache Hit       (Speedup)    (Seconds)    vs Baseline
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Qwen/Qwen3-32B                                51.91s          41.57s          1.25x        -10.34s      -9.3%
deepseek-ai/DeepSeek-R1-0528                  2m 28.94s       2m 24.52s       1.03x        -4.41s       +8.9%
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B      43.29s          37.92s          1.14x        -5.37s       +28.2%
meta-llama/Llama-3.3-70B-Instruct             1m 13.61s       1m 0.61s        1.21x        -13.00s      +4.7%
mistralai/Mistral-Large-Instruct-2411         1m 49.02s       1m 21.75s       1.33x        -27.27s      +16.4%
nvidia/Llama-3.3-70B-Instruct-FP8             1m 22.61s       1m 23.77s       0.99x        +1.17s       +4.1%
====================================================================================================================================================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 18, 2025
@dolpm dolpm force-pushed the remote/serialize-inductor branch from 05baf79 to 96f5d79 Compare September 23, 2025 20:55
@mergify mergify bot added rocm Related to AMD ROCm and removed needs-rebase labels Sep 23, 2025
@dolpm dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from 0ce2620 to 38b6517 Compare September 25, 2025 18:31
@mergify
Copy link

mergify bot commented Sep 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 25, 2025
@dolpm dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from db326a9 to 03fa8b3 Compare September 25, 2025 21:59
@mergify mergify bot removed the needs-rebase label Sep 25, 2025
Copy link
Contributor

@zhxchen17 zhxchen17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dolpm Sorry for the delayed review. Just some initial thoughts.

zhxchen17 added a commit to zhxchen17/vllm that referenced this pull request Oct 22, 2025
Summary:
Fixing issue vllm-project#27348

For dynamo caching, it's possible that the compilation succeeds but
the serialization step fails. In this case, the failure of serialization
step shouldn't block user from getting compilation results correctly.

Therefore we add a handling of the serialization error and only
give warning when model saving fails. When saving fails, VLLM model
runner should be able to just fallback to the old path, and in the
next process, it will fail to load dynamo cache but still fallback
to retracing with dynamo + loading inductor cache, which is the same
behavior to AOT compile turned of off.

This is mostly a short term fix and in the long term we should resolve
the serialization bugs by eliminating pickling of graph modules.

i.e. Once vllm-project#25205 is merged,
we should be able to resolve the issue at a lower level.

Test Plan:

pytest tests/lora/test_quant_model.py

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: zhxchen17 <zhxchen17@fb.com>
@dolpm dolpm force-pushed the remote/serialize-inductor branch from 03fa8b3 to 579ec82 Compare October 24, 2025 18:47
@mergify mergify bot added the performance Performance-related issues label Oct 24, 2025
@mergify
Copy link

mergify bot commented Oct 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 24, 2025
@dolpm dolpm force-pushed the remote/serialize-inductor branch from 579ec82 to 869b3fe Compare October 24, 2025 19:26
@mergify mergify bot removed the needs-rebase label Oct 24, 2025
zhxchen17 added a commit to zhxchen17/vllm that referenced this pull request Oct 27, 2025
Summary:
Fixing issue vllm-project#27348

For dynamo caching, it's possible that the compilation succeeds but
the serialization step fails. In this case, the failure of serialization
step shouldn't block user from getting compilation results correctly.

Therefore we add a handling of the serialization error and only
give warning when model saving fails. When saving fails, VLLM model
runner should be able to just fallback to the old path, and in the
next process, it will fail to load dynamo cache but still fallback
to retracing with dynamo + loading inductor cache, which is the same
behavior to AOT compile turned of off.

This is mostly a short term fix and in the long term we should resolve
the serialization bugs by eliminating pickling of graph modules.

i.e. Once vllm-project#25205 is merged,
we should be able to resolve the issue at a lower level.

Test Plan:

pytest tests/lora/test_quant_model.py

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: zhxchen17 <zhxchen17@fb.com>
@dolpm dolpm force-pushed the remote/serialize-inductor branch from 869b3fe to 7465ae3 Compare October 27, 2025 17:56
@dolpm dolpm force-pushed the remote/serialize-inductor branch 5 times, most recently from cdb89d9 to 7d75e72 Compare October 27, 2025 19:41
@dolpm dolpm marked this pull request as ready for review October 27, 2025 19:43
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@dolpm dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from 9c2eb6d to ed28803 Compare October 29, 2025 17:58
@mergify
Copy link

mergify bot commented Oct 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 29, 2025
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com>
@dolpm dolpm force-pushed the remote/serialize-inductor branch from ed28803 to fca965f Compare November 3, 2025 21:06
@dolpm
Copy link
Contributor Author

dolpm commented Nov 10, 2025

@zou3519 bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build needs-rebase performance Performance-related issues rocm Related to AMD ROCm v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants