|
1 | 1 | <p align="center"> |
2 | 2 | <picture> |
3 | | - <source media="(prefers-color-scheme: dark)" srcset="docs/source/logos/UCM.png"> |
| 3 | + <source media="(prefers-color-scheme: dark)" srcset="docs/source/logos/UCM-dark.png"> |
4 | 4 | <img alt="UCM" src="docs/source/logos/UCM-light.png" width=50%> |
5 | 5 | </picture> |
6 | 6 | </p> |
7 | 7 |
|
8 | 8 | <p align="center"> |
9 | | -| <a href="docs/source/index.md"><b>Documentation</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/16"><b>Roadmap</b></a> | |
| 9 | +| <a href="docs/source/index.md"><b>Documentation</b></a> | <a href="https://modelengine-ai.net/#/ucm"><b>Website</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/issues/78"><b>RoadMap</b></a> | <a href="https://github.com/ModelEngine-Group/unified-cache-management/blob/main/README_zh.md"><b>中文</b></a> | |
10 | 10 | </p> |
11 | 11 |
|
12 | 12 | --- |
13 | 13 |
|
14 | | -*Latest News* 🔥 |
15 | | -- [2025/08/01] We are excited to announce the alpha release of Unified Cache Manager. |
16 | | - |
17 | | ---- |
| 14 | +## Overview |
18 | 15 |
|
19 | | -## Performance |
20 | | -nfs connector has reached about 4x TTFT accelerate. |
| 16 | +The core principle of Unified Cache Manager (UCM) is to persist the LLM KVCache and replace redundant computations |
| 17 | +through multiple retrieval mechanisms. UCM not only supports prefix caching but also offers a variety of training-free |
| 18 | +sparse attention retrieval methods, delivering higher performance when handling extremely long sequence inference tasks. |
| 19 | +Additionally, UCM provides a PD disaggregation solution based on a storage-compute separation architecture, which |
| 20 | +enables more straightforward and flexible management of heterogeneous computing resources. When integrated with vLLM, |
| 21 | +UCM achieves a 3-10x reduction in inference latency across various scenarios, including multi-turn dialogue and |
| 22 | +long-context reasoning tasks. |
21 | 23 |
|
22 | | - |
| 24 | +### Motivation |
23 | 25 |
|
24 | | -## Overview |
| 26 | +With the increase of model size, the KV cache became larger and sparser, especially for long sequence requests. To |
| 27 | +reduce the GPU memory used, offload full KV to external storage and only keep partial or compressed KV in GPU memory |
| 28 | +became the popular direction. This can also reduce the GPU calculation, increase the sequence length and batch size of |
| 29 | +decoding. |
25 | 30 |
|
26 | | -### Motivation |
27 | | -With the increase of model size, the KV cache became larger and sparser, especially for long sequence requests. To reduce the GPU memory used, offload full KV to external storage and only keep partial or compressed KV in GPU memory became the popular direction. This can also reduce the GPU calculation, increase the sequence length and batch size of decoding. |
| 31 | +Sparse KV cache have many different choices. Recently paper point out that there is no common way can fit all scenarios |
| 32 | +and all models. So better to build a common framework then different sparse algorithms can be plugin to it like KV |
| 33 | +connector for PC. |
28 | 34 |
|
29 | | -Sparse KV cache have many different choices. Recently paper point out that there is no common way can fit all scenarios and all models. So better to build a common framework then different sparse algorithms can be plugin to it like KV connector for PC. |
| 35 | + |
30 | 36 |
|
31 | | -### Proposed Change |
32 | | - |
| 37 | +All gray boxes in the diagram represent existing classes in vLLM version 0.9.2, while the green boxes indicate newly added components by UCM. |
| 38 | +The light green boxes demonstrate potential future subclass extensions based on this framework. |
33 | 39 |
|
34 | | -All gray boxes are current classes in 0.9.2. Green boxes are proposed to add. Light green ones show out the future sub classes base on this framework. |
| 40 | +UcmSparseBase is the base class of different sparse algorithms. Just like KV connector design, it will hook few places of |
| 41 | +scheduler and layer.py to do additional load, dump and calculate sparse KV blocks. |
35 | 42 |
|
36 | | -SpareKVBase is the base class of different algorithms. Just like KV connector design, it will hook few places of scheduler and layer.py to allow sparse algorithms do additional load, dump and calculate sparse KV blocks. |
| 43 | +SparseKVManager allows users to define custom KV block allocations for different algorithms. |
| 44 | +To keep all implementations unified under the SparseKVBase framework, the system calls the SparseKVBase base class, |
| 45 | +while the actual implementation occurs in subclasses of sparse algorithms. |
37 | 46 |
|
38 | | -SparseKVManager provide different KV block allocation methods for different algorithms. To keep all implementation under SpareKVBase, it will call SparseKVBase and real implementation will happen in sub class of sparse algorithms. |
| 47 | +KVStoreBase helps decouple sparse algorithms from external storage. It defines methods for communicating with external storage, |
| 48 | +enabling any sparse algorithm to work seamlessly with any external storage system. |
| 49 | +The core concept here involves identifying blocks through IDs and offsets. |
| 50 | +This approach is not only suitable for sparse scenarios but also naturally accommodates prefix caching. |
| 51 | +The KVStoreConnector links it with the current KVConnectorBase_V1 to provide PC (Prefix Caching) functionality. |
| 52 | +For example, NFSStore serves as a reference implementation that provides the capability to store KVCache |
| 53 | +in either a local filesystem for single-machine scenarios or through NFS mount points in multi-server environments. |
39 | 54 |
|
40 | | -KVStoreBase helps decoupling sparse algorithms and external storage. It defined the methods how to talk to external storage, so any sparse algorithms can work with any external storage. Concepts here is blocks identify by ID with offset. This is not only for sparse but also naturally for prefix cache also. KVStoreConnector connect it with current KVConnectorBase_V1 to provide PC function. |
| 55 | +--- |
41 | 56 |
|
42 | | -NFSStore is sample implementation here provide ability to store blocks in local file system or NFS mount point in multi-server case. |
| 57 | +## Support Features |
43 | 58 |
|
44 | | -LocalCachedStore can reference any store to provide local DRAM read cache layer. |
| 59 | +- Prefix Cache |
| 60 | +- Cache Blend |
| 61 | +- Model Window Extrapolation |
| 62 | +- Prefill Offload |
| 63 | +- Sparse Attention |
| 64 | +- Sparse Attention Offload |
| 65 | +- Heterogeneous PD Disaggregation |
45 | 66 |
|
46 | 67 | --- |
47 | 68 |
|
48 | 69 | ## Quick Start |
49 | | -please refer to [installation](docs/source/getting-started/installation.md) and [example](docs/source/getting-started/example/dram_conn.md)。 |
50 | | - |
51 | | ---- |
52 | 70 |
|
53 | | -## Support Features |
54 | | -please refer to [features matrix](docs/source/feature/support.md). |
| 71 | +please refer to [Quick Start](./docs/source/getting-started/quick_start.md). |
55 | 72 |
|
56 | 73 | --- |
57 | 74 |
|
58 | | -## Branch Policy |
59 | | -Unified Cache has main branch, develop branch and release branch. |
60 | | -- **main**: main is the most stable branch. Only the release branch can be integrated. The tag is attached to the main branch. |
61 | | -- **develop**: develop is a daily development branch, new features will be merged in this branch. |
62 | | -- **x.x.x-release**: each time we decide to release a new version, we checkout a release branch and test on this branch, this branch only accepted [bugfix]. When the branch passed test, we merge the branch into develop and main, tag the corresponding x.x.x tag based on the main branch, and finish the release. |
| 75 | +## Branch |
63 | 76 |
|
64 | | -Usually, a commit should be ONLY first merged in the develop branch. |
| 77 | +| **Branch** | Status | vLLM version | |
| 78 | +|-----------:|-----------:|-------------:| |
| 79 | +| main | Maintained | v0.9.2 | |
| 80 | +| develop | Maintained | v0.9.2 | |
65 | 81 |
|
66 | 82 | --- |
67 | 83 |
|
68 | | -## Contributing |
69 | | -When you want to contribute some features to the Unified Cache Community, first fork a branch (usually develop) to your own repository, then commit in your own repository, and finally submit a pull request to the community. |
| 84 | +## Contact Us |
70 | 85 |
|
71 | | ---- |
| 86 | +For technical questions and feature requests, please use |
| 87 | +GitHub [Issues](https://github.com/ModelEngine-Group/unified-cache-management/issues). |
72 | 88 |
|
73 | 89 | ## License |
74 | 90 |
|
75 | | -UCM is licensed under the MIT with additional conditions. Please read the [LICENSE](./LICENSE) file for details. |
| 91 | +UCM is licensed under the MIT with additional conditions. Please read the [LICENSE](./LICENSE) file for details. |
0 commit comments