-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add Anthropic prompt caching support #3363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
791999d to
5b5cb9f
Compare
| """Add cache control to the last content block param.""" | ||
| if not params: | ||
| raise UserError( | ||
| 'CachePoint cannot be the first content in a user message - there must be previous content to attach the CachePoint to.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying in context from https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#what-can-be-cached:
Tools: Tool definitions in the tools array
System messages: Content blocks in the system array
Text messages: Content blocks in the messages.content array, for both user and assistant turns
Images & Documents: Content blocks in the messages.content array, in user turns
Tool use and tool results: Content blocks in the messages.content array, in both user and assistant turns
I think we should support inserting a cache point after tool defs and system messages as well.
In the original PR I suggested doing this by supporting CachePoint as the first content in a user message (by adding it to whatever came before it: the system message, tool definition, or the last message of the assistant output), but that doesn't really feel natural from a code perspective.
What do you think about adding anthropic_cache_tools and anthropic_cache_instructions fields to AnthropicModelSettings, and setting cache_control on the relevant parts when set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, I'll look into it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the message here to make it clear that they are likely looking for one of the 2 settings instead.
|
@ronakrm If you're up for it, I'd welcome Bedrock support in this PR as well. It'll have that one bug (#2560 (comment)) but most users won't hit it, and it's clearly on their side to fix, not ours. Initially I thought we should hold off until they'd fixed it, but I'd rather just get this out for most people who won't hit the issue anyway. |
I can take a stab at this, but was a bit concerned about scope-creep causing me to get less excited and delay work on this, and my current inability to test a live Bedrock example. I may first get a full pass on the pure-Anthropic side if that's alright with you. (Also not sure what you're timelines are for this, but I should be able to make another pass at this in the next few days) |
Sounds good! I can then do Bedrock in a follow up PR.
That's great, thanks. |
b1f6d6c to
7ef071f
Compare
This implementation adds prompt caching support for Anthropic models, allowing users to cache parts of prompts (system prompts, long context, tools) to reduce costs by ~90% for cached tokens. Key changes: - Add CachePoint class to mark cache boundaries in prompts - Implement cache control in AnthropicModel using BetaCacheControlEphemeralParam - Add cache metrics mapping (cache_creation_input_tokens → cache_write_tokens) - Add comprehensive tests for CachePoint functionality - Add working example demonstrating prompt caching usage - Add CachePoint filtering in OpenAI models for compatibility The implementation is Anthropic-only (removed Bedrock complexity from original PR pydantic#2560) for a cleaner, more maintainable solution. Related to pydantic#2560 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Fix TypedDict mutation in anthropic.py using cast() - Handle CachePoint in otel message conversion (skip for telemetry) - Add CachePoint handling in all model providers for compatibility - Models without caching support (Bedrock, Gemini, Google, HuggingFace, OpenAI) now filter out CachePoint markers All pyright type checks now pass.
Adding CachePoint handling pushed method complexity over the limit (16 > 15). Added noqa: C901 to suppress the complexity warning.
- Add test_cache_point_in_otel_message_parts to cover CachePoint in otel conversion - Add test_cache_control_unsupported_param_type to cover unsupported param error - Use .get() for TypedDict access to avoid type checking errors - Add type: ignore for testing protected method - Restore pragma: lax no cover on google.py file_data handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add test_cache_point_filtering for OpenAI, Bedrock, Google, and Hugging Face - Tests verify CachePoint is filtered out without errors - Achieves 100% coverage for CachePoint code paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit addresses maintainer feedback on the Anthropic prompt caching PR: - Add anthropic_cache_tools field to cache last tool definition - Add anthropic_cache_instructions field to cache system prompts - Rewrite existing CachePoint tests to use snapshot() assertions - Add comprehensive tests for new caching settings - Remove standalone example file, add docs section instead - Move imports to top of test files - Remove ineffective Google CachePoint test - Add "Supported by: Anthropic" to CachePoint docstring - Add Anthropic docs link in cache_control method Tests are written but snapshots not yet generated (will be done in next commit).
7ef071f to
57d051a
Compare
|
So I think the current Python 3.11 + lowest-direct test failure is unrelated to this PR. The Python 3.11 (lowest-direct) CI failure is not caused by the CachePoint feature but maybe a pre-existing/old Python 3.11 issue. The test Via CC debugging, it seems like:
I don't think the pure Python changes here can cause "illegal instruction" errors, this is probably always a C extension issue? The crash seems timing-dependent: main branch imports work, but this branch's slightly different timing exposes the C extension bug (affecting the parallel test exec via pytest-xdist worker processes) Unfortunately I can't reproduce this locally because I don't have CUDA, and --resolution lowest-direct tries to build vllm 0.1.3 which requires CUDA: Claude recommends:
@DouweM Any thoughts or recommendations? |
- Add test_cache_point_with_streaming to verify CachePoint works with run_stream() - Add test_cache_point_with_unsupported_type to verify error handling for non-cacheable content types - Add test_cache_point_in_user_prompt to verify CachePoint is filtered in OpenTelemetry conversion - Fix test_cache_point_filtering in test_google.py to properly test _map_user_prompt method - Enhance test_cache_point_filtering in test_openai.py to directly test both Chat and Responses models - Add test_cache_point_filtering_responses_model for OpenAI Responses API These tests increase diff coverage from 68% to 98% (100% for all production code). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move CachePoint imports to top of test files (test_bedrock.py, test_huggingface.py) - Add documentation link for cacheable_types in anthropic.py Addresses feedback from @DouweM in PR pydantic#3363 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add explicit list[ModelMessage] type annotations in test_instrumented.py - Fix pyright ignore comment placement in test_openai.py - Remove unnecessary type ignore comments Fixes CI pyright errors reported on Python 3.10 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
|
||
| # Add cache_control to the last tool if enabled | ||
| if tools and model_settings.get('anthropic_cache_tools'): | ||
| last_tool = cast(dict[str, Any], tools[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to cast it? I'd rather change the type of BetaToolUnionParam to not be a union, so that we can be sure it's a (typed)dict here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BetaToolUnionParam is an upstream anthropic package type, I think this is the best we can do for now?
(This cast was unneeded, but the one at ~L700 is, comment added)
| """Add cache control to the last content block param.""" | ||
| if not params: | ||
| raise UserError( | ||
| 'CachePoint cannot be the first content in a user message - there must be previous content to attach the CachePoint to.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update the message here to make it clear that they are likely looking for one of the 2 settings instead.
| # Only certain types support cache_control | ||
| # See https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching#what-can-be-cached | ||
| cacheable_types = {'text', 'tool_use', 'server_tool_use', 'image', 'tool_result'} | ||
| last_param = cast(dict[str, Any], params[-1]) # Cast to dict for mutation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't work without the cast?
Co-authored-by: Douwe Maan <me@douwe.me>
Co-authored-by: Douwe Maan <me@douwe.me>
- Rename anthropic_cache_tools to anthropic_cache_tool_definitions for clarity - Add backticks to docstrings for code identifiers (cache_control, tools) - Improve error message to mention alternative cache settings - Remove unnecessary cast for BetaToolUnionParam (line 441) - Add explanatory comment for necessary cast of BetaContentBlockParam (line 703) - Update Bedrock comment to link to issue pydantic#3418 All tests pass with these changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
These mock tool functions are never actually called during tests, so their return statements don't need coverage. Achieves 100% coverage for test_anthropic.py. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add proper mkdocs cross-reference links to anthropic_cache_instructions - Add proper mkdocs cross-reference links to anthropic_cache_tool_definitions - Link to model settings documentation section Per maintainer feedback on PR pydantic#3363. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace three separate sections (each with individual examples) with: - A concise bulleted list of the three caching methods - One comprehensive example showing all three methods combined This reduces repetition and makes the documentation more scannable. Per maintainer feedback on PR pydantic#3363. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@ronakrm Thanks Ronak! |
Summary
This PR adds prompt caching support for Anthropic models, allowing users to cache parts of prompts (system prompts, long context, tools) to reduce costs by ~90% for cached tokens.
This is a simplified, Anthropic-only implementation based on the work in #2560, following the maintainer's suggestion to "launch this for just Anthropic first."
Core Implementation
CachePointclass: Simple marker that can be inserted into user prompts to indicate cache boundariesAnthropicModel: UsesBetaCacheControlEphemeralParamto addcache_controlto content blockscache_write_tokensandcache_read_tokensviagenai-pricesCachePointis passed through for all other models (ignored)Example Usage
Testing
Compatibility
Real-World Test Results
Tested with live Anthropic API:
Request 1 (cache write): cache_write_tokens=3264
Request 2 (cache read): cache_read_tokens=3264
Request 3 (cache read): cache_read_tokens=3264
Total savings: ~5875 token-equivalents
I likely can create a stacking PR to push system prompt caching for Anthropic as well (this needs to update
_map_messageand related code to just always have a list of blocks, and user-based string system prompts should probably just be detected and mapped into the json format).