[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction #25205

dolpm · 2025-09-18T20:39:30Z

Adds support for inductor artifacts in VllmCompiledFunction

Purpose

we want to ensure that all pre-compiled artifacts (e.g., dynamo+inductor) live under one roof (i.e., VllmSerializableFunction).

the general implementation is as follows.

in the compile decorator, add support to only serialize/save the VllmCompiledFunction after all user-specified-shapes have been compiled. this will be done after the warmup phase when we encounter a cache miss.
add a to_bytes method to cuda_piecewise_backend which will return a mapping from each shape to it's compiled bytes (using the function mentioned above).
when we serialize the VllmCompiledFunction, we iterate through all the compiled subgraphs and add these to a cache which can be serialized/deserialized and will dedupe the byte blobs. during deserialization, we will bulk-load these to get all of the runnables, and then re-map them to the submodule/shape pairs under which they were serialized originally.
when we load from the cache, we re-construct the split gm using the cached artifacts. we will pick the runnables from this cache directly. no need to warmup to re-map, since we know the mapping from shape->runnable at construction time, which will give us some perf.

Test Result

python benchmarks/compile/benchmark_inductor_compiled_artifacts.py --max-tokens 16 --include-baseline

====================================================================================================================================================================================

Model                                         OLD: Baseline   NEW: Inductor   BENEFIT      Time Saved   Miss Penalty
(Approach being compared)                     AOT Cache       Cache Hit       (Speedup)    (Seconds)    vs Baseline
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Qwen/Qwen3-32B                                51.91s          41.57s          1.25x        -10.34s      -9.3%
deepseek-ai/DeepSeek-R1-0528                  2m 28.94s       2m 24.52s       1.03x        -4.41s       +8.9%
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B      43.29s          37.92s          1.14x        -5.37s       +28.2%
meta-llama/Llama-3.3-70B-Instruct             1m 13.61s       1m 0.61s        1.21x        -13.00s      +4.7%
mistralai/Mistral-Large-Instruct-2411         1m 49.02s       1m 21.75s       1.33x        -27.27s      +16.4%
nvidia/Llama-3.3-70B-Instruct-FP8             1m 22.61s       1m 23.77s       0.99x        +1.17s       +4.1%
====================================================================================================================================================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-18T20:40:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-09-25T18:32:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zhxchen17

@dolpm Sorry for the delayed review. Just some initial thoughts.

vllm/compilation/decorators.py

vllm/compilation/backends.py

Summary: Fixing issue vllm-project#27348 For dynamo caching, it's possible that the compilation succeeds but the serialization step fails. In this case, the failure of serialization step shouldn't block user from getting compilation results correctly. Therefore we add a handling of the serialization error and only give warning when model saving fails. When saving fails, VLLM model runner should be able to just fallback to the old path, and in the next process, it will fail to load dynamo cache but still fallback to retracing with dynamo + loading inductor cache, which is the same behavior to AOT compile turned of off. This is mostly a short term fix and in the long term we should resolve the serialization bugs by eliminating pickling of graph modules. i.e. Once vllm-project#25205 is merged, we should be able to resolve the issue at a lower level. Test Plan: pytest tests/lora/test_quant_model.py Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>

mergify · 2025-10-24T18:48:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/compilation/compiler_interface.py

vllm/compilation/caching.py

vllm/compilation/compiler_interface.py

benchmarks/compile/benchmark_inductor_cache.py

vllm/v1/attention/backends/flashinfer.py

vllm/compilation/backends.py

vllm/v1/worker/gpu_worker.py

Summary: Fixing issue vllm-project#27348 For dynamo caching, it's possible that the compilation succeeds but the serialization step fails. In this case, the failure of serialization step shouldn't block user from getting compilation results correctly. Therefore we add a handling of the serialization error and only give warning when model saving fails. When saving fails, VLLM model runner should be able to just fallback to the old path, and in the next process, it will fail to load dynamo cache but still fallback to retracing with dynamo + loading inductor cache, which is the same behavior to AOT compile turned of off. This is mostly a short term fix and in the long term we should resolve the serialization bugs by eliminating pickling of graph modules. i.e. Once vllm-project#25205 is merged, we should be able to resolve the issue at a lower level. Test Plan: pytest tests/lora/test_quant_model.py Reviewers: Subscribers: Tasks: Tags: Signed-off-by: zhxchen17 <zhxchen17@fb.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/compilation/backends.py

mergify · 2025-10-29T18:01:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @dolpm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com>

dolpm · 2025-11-10T21:15:01Z

@zou3519 bump

mergify bot added ci/build v1 labels Sep 18, 2025

mergify bot added the needs-rebase label Sep 18, 2025

dolpm force-pushed the remote/serialize-inductor branch from 05baf79 to 96f5d79 Compare September 23, 2025 20:55

mergify bot added rocm Related to AMD ROCm and removed needs-rebase labels Sep 23, 2025

dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from 0ce2620 to 38b6517 Compare September 25, 2025 18:31

mergify bot added the needs-rebase label Sep 25, 2025

dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from db326a9 to 03fa8b3 Compare September 25, 2025 21:59

mergify bot removed the needs-rebase label Sep 25, 2025

zhxchen17 reviewed Oct 9, 2025

View reviewed changes

zhxchen17 mentioned this pull request Oct 22, 2025

[CI Failure]: AssertionError on graph pickler while serializing symint nodes. #27348

Closed

3 tasks

zhxchen17 mentioned this pull request Oct 22, 2025

[compile] Add fallback path to AOT compile when serialization fails. #27350

Merged

5 tasks

dolpm force-pushed the remote/serialize-inductor branch from 03fa8b3 to 579ec82 Compare October 24, 2025 18:47

mergify bot added the performance Performance-related issues label Oct 24, 2025

mergify bot added the needs-rebase label Oct 24, 2025

dolpm force-pushed the remote/serialize-inductor branch from 579ec82 to 869b3fe Compare October 24, 2025 19:26

mergify bot removed the needs-rebase label Oct 24, 2025

jamesjwu reviewed Oct 24, 2025

View reviewed changes

vllm/compilation/compiler_interface.py Outdated Show resolved Hide resolved

jamesjwu reviewed Oct 24, 2025

View reviewed changes

vllm/compilation/caching.py Outdated Show resolved Hide resolved

jamesjwu reviewed Oct 24, 2025

View reviewed changes

vllm/compilation/compiler_interface.py Outdated Show resolved Hide resolved

zhxchen17 reviewed Oct 24, 2025

View reviewed changes

benchmarks/compile/benchmark_inductor_cache.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/compilation/backends.py Show resolved Hide resolved

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

dolpm force-pushed the remote/serialize-inductor branch from 869b3fe to 7465ae3 Compare October 27, 2025 17:56

dolpm force-pushed the remote/serialize-inductor branch 5 times, most recently from cdb89d9 to 7d75e72 Compare October 27, 2025 19:41

dolpm marked this pull request as ready for review October 27, 2025 19:43

dolpm requested review from ProExpertProg, hmellor, youkaichao and zou3519 as code owners October 27, 2025 19:43

chatgpt-codex-connector bot reviewed Oct 27, 2025

View reviewed changes

vllm/compilation/backends.py Show resolved Hide resolved

dolpm force-pushed the remote/serialize-inductor branch 2 times, most recently from 9c2eb6d to ed28803 Compare October 29, 2025 17:58

mergify bot added the needs-rebase label Oct 29, 2025

serialize inductor artifacts

fca965f

Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com>

dolpm force-pushed the remote/serialize-inductor branch from ed28803 to fca965f Compare November 3, 2025 21:06

Uh oh!

[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction #25205

Are you sure you want to change the base?

[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction #25205

Uh oh!

Conversation

dolpm commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Result

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

zhxchen17 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify bot commented Oct 29, 2025

Uh oh!

dolpm commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dolpm commented Sep 18, 2025 •

edited by github-actions bot

Loading