-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat(cache): add cache store abstraction layer #4166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cache): add cache store abstraction layer #4166
Conversation
Implement reusable cache store abstraction with in-memory and Redis backends as foundation for prompt caching feature (PR1 of progressive delivery). - Add CacheStore protocol defining cache interface - Implement MemoryCacheStore with LRU, LFU, and TTL-only eviction policies - Implement RedisCacheStore with connection pooling and retry logic - Add CircuitBreaker for cache backend failure protection - Include comprehensive unit tests (55 tests, >80% coverage) - Add dependencies: cachetools>=5.5.0, redis>=5.2.0 This abstraction enables flexible caching implementations for the prompt caching middleware without coupling to specific storage backends. Signed-by: William Caban <willliam.caban@gmail.com>
|
Can you describe the intention of the prompt caching here? The kind of prompt caching I'm aware of is typically implemented in the actual inference provider, such as vLLM's prefix caching, llm-d's prefix aware caching and routing, OpenAI's prompt caching, etc. What the above generally does is route requests to servers that are most likely to have the tokens generated from the given input already cached. Llama Stack isn't an inference server, and we don't deal with actual tokens. That would be the job of the inference server / service to handle the typical prompt caching scenario I outlined above. What is it you intend to store in this caching layer? |
|
++@bbrowning, this does not appear to be the kind of prompt caching we want in the Stack at all. |
|
The main intention is the implementation of classes to be used by the PR (future) for registring prompt templates (see example)
Example of final state: import mlflow
from llama_stack_client import LlamaStackClient
# Register prompt in MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.genai.register_prompt(
name="helpful-assistant",
template="You are a helpful AI assistant specialized in {{domain}}. "
"Always provide accurate and concise answers.",
tags={"category": "system-prompts"}
)
# Use prompt via Llama Stack
client = LlamaStackClient(base_url="http://localhost:8321")
# Get prompt from MLflow via Llama Stack
prompt = client.prompts.get(prompt_id="helpful-assistant")
# Use in chat completion - benefits from caching
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": prompt.prompt.replace("{{domain}}", "machine learning")},
{"role": "user", "content": "Explain neural networks"}
]
)In the case of OpenAI API (https://platform.openai.com/docs/guides/prompting#create-a-prompt), the prompt templates are registred through the OpenAI pllayground and then the id is referenced and consume through the regular messages. In addition, the optional capability the prompt caching can provide when used outside prompt templates, replicates OpenAI Prompt Caching behavior (https://platform.openai.com/docs/guides/prompt-caching) to allows Llama Stack to serve as the prompt caching layer for inference providers that do not support it or have partial support. (The intent of this is NOT a prefix aware caching & routing mechanism)
Key:
|
This PR implements Phase 1 of the prompt caching feature - automatic caching of prompt prefixes in OpenAI-compatible chat completion requests. **Key Features:** - Automatic caching of prompts ≥1024 tokens (configurable) - SHA-256 cache key computation (FIPS-compliant) - Multi-tenant isolation (tenant_id + user_id in cache keys) - Circuit breaker pattern for graceful degradation - Streaming request bypass (configurable) - Token counting integration (PR2) - Cache store abstraction integration (PR1) - OpenAI response schema updates (PR3) **Implementation:** - src/llama_stack/core/server/prompt_caching.py - tests/unit/server/test_prompt_caching.py - 25 comprehensive unit tests (100% passing) - >95% code coverage **Dependencies:** - Requires PR1 (cache-store-abstraction) - Requires PR2 (tokenization-utilities) - Requires PR3 (openai-response-schema) **Test Results:** - 25/25 unit tests passing - All pre-commit checks passing (mypy, ruff, ruff-format) Part of prompt caching implementation - Phase 1 of llamastack#4166 Signed-off-by: William Caban <william.caban@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>
|
So, it feels like this is conflating two very different things. Prompt caching is not something Llama Stack generally can or should do, as Llama Stack plays no role in turning requests into tokens. Being able to reference prompts in Responses requests is something Llama Stack can do, and allowing those prompt references to be pulled from some place like MLflow as shown in your example above could be a useful feature. |
mattf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamcaban using mlflow to back prompt management makes sense -
from openai import Client
import requests
import json
response = requests.request("POST", "http://localhost:8321",
data=json.dumps({
"prompt": "You are a helpful AI assistant specialized in {{domain}}. Always provide accurate and concise answers.",
"variables": ["domain"]
})
)
[you can inspect mlflow to find the prompt]
client = Client(base_url="http://localhost:8321", api_key="none")
response = client.responses.create(
model="meta-llama/Llama-3.1-8B-Instruct",
prompt={"id": response["prompt_id"], "variables": {"domain": "machine learning"}}
input="Explain neural networks"}
)if the mlflow prompt provider wants to cache the prompts in the stack server, that could be useful.
however, the prompt caching feature is about caching the prompt's tokens, which stack cannot utilize. the only api i'm aware of that accepts tokens as input is /v1/embeddings.
+1, I think a demo would sufficient to show how MLFlow and Llama Stack can play nicely together. Particularly for optimizing and evaluating various parameters within the stack and being able to track the performance on some hold out data of interest. You could then register (i.e., As others have noted, this isn't the traditional prompt caching that most users consider. Since our stack's Prompt API supports multiple KVStore backends you can already configure Redis as a DB for it in the |
@williamcaban yes pls |
|
This pull request has merge conflicts that must be resolved before it can be merged. @williamcaban please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork |
|
Closing in favor of #4170 without caching options. |
OpenAI-Compatible Prompt Caching Feature - Phase 1 (cache store layer)
This PR implements the Cache Store Abstraction Layer for the OpenAI-compatible prompt caching feature. It provides a protocol-based abstraction for cache storage backends, enabling flexible implementations (memory, Redis) that will be used by the prompt caching middleware in an upcoming PR.
This is the first of various progressive PRs implementing prompt caching as outlined in the following strategy.
Strategy
Implementation strategy for extending the Llama Stack OpenAI-compatible API to support Prompt Caching (as per OpenAI's implementation) while integrating with MLflow's prompt registry for prompt management and versioning.
cached_tokensin usage statisticsChanges
Core Implementation (4 new files)
src/llama_stack/providers/utils/cache/cache_store.py(~250 lines)CacheStoreprotocol with async methods:get,set,delete,exists,ttl,clear,sizeCacheErrorexception for graceful error handlingCircuitBreakerpattern for failure protection:src/llama_stack/providers/utils/cache/memory.py(~370 lines)cachetoolslibrarymax_entries(default: 1000)max_memory_mb(default: 512MB, soft limit)default_ttl(default: 600 seconds)src/llama_stack/providers/utils/cache/redis.py(~460 lines)llama_stack:)src/llama_stack/providers/utils/cache/__init__.pyTests (4 new test files, 55 test cases)
tests/unit/providers/utils/cache/test_cache_store.py(17 tests)CacheErrorexception handlingCircuitBreakerstate transitions and recovery logictests/unit/providers/utils/cache/test_memory_cache.py(18 tests)tests/unit/providers/utils/cache/test_redis_cache.py(20 tests)Dependencies
pyproject.toml(modified)cachetools>=5.5.0- In-memory caching with LRU/LFU supportredis>=5.2.0- Redis client with async supportTesting
Unit Tests
Results:
Checklist
llama_stack.logpyproject.tomlArchitecture Notes
Protocol-based design: Uses Python's
Protocolfor the cache store interface, allowing any implementation that satisfies the contract without requiring inheritanceCircuit breaker pattern: Prevents cascade failures by temporarily disabling cache operations after repeated failures, with automatic recovery attempts
Async-first API: All cache operations are async to support non-blocking I/O and integration with FastAPI middleware
Separation of concerns:
cache_store.py)memory.py,redis.py)Graceful degradation: Cache failures never block requests - errors are logged but handled gracefully
Security Considerations