-
Notifications
You must be signed in to change notification settings - Fork 661
[Feature] Enable prefix caching for mtp #5302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[Feature] Enable prefix caching for mtp #5302
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables prefix caching for Multi-Token Prediction (MTP) speculative decoding by removing a conditional check that previously disabled this feature when speculative_config was present.
Key Changes
- Removes the check that disabled
enable_prefix_cachingwhenspeculative_config is not None - Allows MTP to benefit from prefix caching, which should improve performance for requests with shared prefixes
- The MTP implementation code already supports prefix caching (as seen in
fastdeploy/spec_decode/mtp.pyline 180)
| if self.splitwise_role == "decode": | ||
| self.enable_prefix_caching = False | ||
| if self.speculative_config is not None: | ||
| self.enable_prefix_caching = False | ||
| if not current_platform.is_cuda() and not current_platform.is_xpu() and not current_platform.is_intel_hpu(): | ||
| self.enable_prefix_caching = False |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个改动启用了 MTP 的 prefix caching 功能,但是缺少相应的测试覆盖。虽然已有 tests/e2e/test_ernie_21b_mtp.py 测试 MTP 功能,但该测试没有明确验证 prefix caching 是否正常工作。
建议添加以下测试:
- 测试 MTP 启用 prefix caching 时的正确性(例如,多个请求共享相同前缀时的行为)
- 测试 prefix caching 对 MTP 性能的影响
- 验证 KV cache 的正确共享和管理
或者,如果现有的 e2e 测试已经隐式覆盖了这个场景,请在 PR 描述中说明测试是如何验证这个改动的。
| self.tokenizer = self.model | ||
| if self.splitwise_role == "decode": | ||
| self.enable_prefix_caching = False | ||
| if self.speculative_config is not None: | ||
| self.enable_prefix_caching = False | ||
| if not current_platform.is_cuda() and not current_platform.is_xpu() and not current_platform.is_intel_hpu(): | ||
| self.enable_prefix_caching = False |
Copilot
AI
Nov 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据自定义代码审查指南,PR 的描述信息不够完整。描述中只说明了"Enable prefix caching for mtp by default",但缺少以下关键信息:
- 为什么要做这个修改: 之前为什么禁用了 speculative_config 的 prefix caching?现在为什么可以启用?
- 如何验证修改的正确性: 是否进行了测试?测试结果如何?
- 潜在影响: 这个改动对现有使用 MTP 的用户有什么影响?
建议补充说明:
- MTP 代码(如
fastdeploy/spec_decode/mtp.py第 180 行)已经支持 prefix caching 的技术细节 - 经过了哪些测试验证(准确性、性能等)
- 是否有相关的 benchmark 数据支持这个改动
这些信息对于理解和审查这个功能性变更非常重要。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5302 +/- ##
==========================================
Coverage ? 59.04%
==========================================
Files ? 322
Lines ? 39196
Branches ? 5888
==========================================
Hits ? 23143
Misses ? 14224
Partials ? 1829
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Enable prefix caching for mtp by default.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.