Skip to content

Commit d1b61d3

Browse files
committed
vllm-omni post init and roadmap updates
1 parent 049dd09 commit d1b61d3

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed

_posts/2025-11-30-vllm-omni.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
## **Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving" author: "The vLLM-Omni Team"**
2+
3+
We are excited to announce the official release of **vLLM-Omni**, a major extension of the vLLM ecosystem designed to support the next generation of AI: omni-modality models.
4+
5+
Since its inception, vLLM has focused on high-throughput, memory-efficient serving for Large Language Models (LLMs). However, the landscape of generative AI is shifting rapidly. Models are no longer just about text-in, text-out. Today's state-of-the-art models reason across text, images, audio, and video, and they generate heterogeneous outputs using diverse architectures.
6+
7+
**vLLM-Omni** answers this call, extending vLLM’s legendary performance to the world of multi-modal and non-autoregressive inference.
8+
9+
\<p align="center"\>
10+
\<img src="/assets/figures/vllm-omni-logo-text-dark.png" alt="vLLM Omni Logo" width="60%"\>
11+
\</p\>
12+
13+
## **Why vLLM-Omni?**
14+
15+
Traditional serving engines were optimized for text-based Autoregressive (AR) tasks. As models evolve into "omni" agents—capable of seeing, hearing, and speaking—the serving infrastructure must evolve with them.
16+
17+
vLLM-Omni addresses three critical shifts in model architecture:
18+
19+
1. **True Omni-Modality:** Processing and generating Text, Image, Video, and Audio seamlessly.
20+
2. **Beyond Autoregression:** Extending vLLM's efficient memory management to **Diffusion Transformers (DiT)** and other parallel generation models.
21+
3. **Heterogeneous Pipelines:** Managing complex workflows where a single request might trigger a visual encoder, an AR reasoning step, and a diffusion-based video generation step.
22+
23+
## **Inside the Architecture**
24+
25+
vLLM-Omni is not just a wrapper; it is a re-imagining of how vLLM handles data flow. It introduces a fully disaggregated pipeline that allows for dynamic resource allocation across different stages of generation.
26+
27+
\<p align="center"\>
28+
\<img src="/assets/figures/omni-modality-model-architecture.png" alt="Omni-modality model architecture" width="80%"\>
29+
\</p\>
30+
As shown above, the architecture unifies distinct phases:
31+
32+
* **Modality Encoders:** Efficiently processing inputs (ViT, T5, etc.)
33+
* **LLM Core:** leveraging vLLM's PagedAttention for the autoregressive reasoning stage.
34+
* **Modality Generators:** High-performance serving for DiT and other decoding heads to produce rich media outputs.
35+
36+
### **Key Features**
37+
38+
* **Simplicity:** If you know how to use vLLM, you know how to use vLLM-Omni. We maintain seamless integration with Hugging Face models and offer an OpenAI-compatible API server.
39+
40+
# todo @liuhongsheng, add the vLLM-Omni architecture
41+
42+
43+
* **Flexibility:** With the OmniStage abstraction, we provide a simple and straightforward way to support various Omni-Modality models including Qwen-Omni, Qwen-Image, SD models.
44+
45+
46+
* **Performance:** We utilize pipelined stage execution to overlap computation, ensuring that while one stage is processing, others aren't idle.
47+
48+
# todo @zhoutaichang, please add a figure to illustrate the pipelined stage execution.
49+
50+
## **Performance**
51+
52+
We benchmarked vLLM-Omni against Hugging Face Transformers to demonstrate the efficiency gains in omni-modal serving.
53+
54+
| Metric | vLLM-Omni | HF Transformers | Improvement |
55+
| :---- | :---- | :---- | :---- |
56+
| **Throughput** (req/s) | **TBD** | TBD | **TBD x** |
57+
| **Latency** (TTFT ms) | **TBD** | TBD | **TBD x** |
58+
| **GPU Memory** (GB) | **TBD** | TBD | **TBD %** |
59+
60+
*Note: Benchmarks were run on \[Insert Hardware Specs\] using \[Insert Model Name\].*
61+
62+
## **Future Roadmap**
63+
64+
vLLM-Omni is evolving rapidly. Our roadmap is focused on expanding model support and pushing the boundaries of efficient inference even further.
65+
66+
* **Expanded Model Support:** We plan to support a wider range of open-source omni-models and diffusion transformers as they emerge.
67+
* **Deeper vLLM Integration:** merging core omni-features upstream to make multi-modality a first-class citizen in the entire vLLM ecosystem.
68+
* **Diffusion Acceleration:** parallel inference(DP/TP/SP/USP...), cache acceleration(TeaCache/DBCache...) and compute acceleration(quantization/sparse attn...).
69+
* **Full disaggregation:** Based on the OmniStage abstraction, we expect to support full disaggregation (encoder/prefill/decode/generation) across different inference stages in order to improve throughput and reduce latency.
70+
* **Hardware Support:** Following the hardware plugin system, we plan to expand our support for various hardware backends to ensure vLLM-Omni runs efficiently everywhere.
71+
72+
Contributions and collabrations from the open source community are welcome.
73+
74+
## **Getting Started**
75+
76+
Getting started with vLLM-Omni is straightforward. The initial release is built on top of vLLM v0.11.0.
77+
78+
### **Installation**
79+
80+
First, set up your environment:
81+
82+
\# Create a virtual environment
83+
uv venv \--python 3.12 \--seed
84+
source .venv/bin/activate
85+
86+
\# Install the base vLLM
87+
uv pip install vllm==0.11.0 \--torch-backend=auto
88+
89+
Next, install the vLLM-Omni extension:
90+
91+
git clone \[https://github.com/vllm-project/vllm-omni.git\](https://github.com/vllm-project/vllm-omni.git)
92+
cd vllm\_omni
93+
uv pip install \-e .
94+
95+
### **Running the Qwen3-Omni model**
96+
97+
@huayongxiang, add the gradio example for Qwen3-Omni model inference
98+
99+
Check out our [examples directory](https://www.google.com/search?q=https://github.com/vllm-project/vllm-omni/tree/main/examples) for specific scripts to launch image, audio, and video generation workflows.
100+
101+
## **Join the Community**
102+
103+
This is just the beginning for omni-modality serving. We are actively developing support for more architectures and invite the community to help shape the future of vLLM-Omni.
104+
105+
@gaohan, update the links after vllm-omni released
106+
* **Code & Docs:** [GitHub Repository](https://github.com/vllm-project/vllm-omni) | [Documentation](https://vllm-omni.readthedocs.io/en/latest/)
107+
* **Weekly Meeting:** Join us every Wednesday at 11:30 (UTC+8) to discuss roadmap and features. [Join here](https://tinyurl.com/vllm-omni-meeting).
108+
109+
Let's build the future of omni-modal serving together\!

0 commit comments

Comments
 (0)