feat(cache): add cache store abstraction layer #4166

williamcaban · 2025-11-15T19:59:00Z

OpenAI-Compatible Prompt Caching Feature - Phase 1 (cache store layer)

This PR implements the Cache Store Abstraction Layer for the OpenAI-compatible prompt caching feature. It provides a protocol-based abstraction for cache storage backends, enabling flexible implementations (memory, Redis) that will be used by the prompt caching middleware in an upcoming PR.

This is the first of various progressive PRs implementing prompt caching as outlined in the following strategy.

Strategy

Implementation strategy for extending the Llama Stack OpenAI-compatible API to support Prompt Caching (as per OpenAI's implementation) while integrating with MLflow's prompt registry for prompt management and versioning.

Enable OpenAI-style Prompt Caching: Automatically cache prompt prefixes longer than 1,024 tokens (configurable) to reduce latency and costs
Integrate MLflow Prompt Registry: Use MLflow as an external provider for prompt storage, versioning, and management
Maintain OpenAI API Compatibility: Ensure compatibility with OpenAI's response format including cached_tokens in usage statistics
Provider-agnostic Design: Support caching across multiple inference providers (OpenAI, Anthropic, Together, Ollama, etc.)

Changes

Core Implementation (4 new files)

src/llama_stack/providers/utils/cache/cache_store.py (~250 lines)

Define CacheStore protocol with async methods: get, set, delete, exists, ttl, clear, size
Add CacheError exception for graceful error handling
Implement CircuitBreaker pattern for failure protection:
- Configurable failure threshold (default: 10 failures)
- Automatic recovery timeout (default: 60 seconds)
- Three states: CLOSED, OPEN, HALF_OPEN
- Prevents cascade failures when cache backend is unavailable

src/llama_stack/providers/utils/cache/memory.py (~370 lines)

In-memory cache store using cachetools library
Support for multiple eviction policies:
- LRU (Least Recently Used) - default
- LFU (Least Frequently Used)
- TTL-only (Time-based expiration)
Configurable limits:
- max_entries (default: 1000)
- max_memory_mb (default: 512MB, soft limit)
- default_ttl (default: 600 seconds)
Thread-safe for concurrent access
Automatic expired entry cleanup

src/llama_stack/providers/utils/cache/redis.py (~460 lines)

Production-ready Redis cache store
Connection pooling for efficient resource usage:
- Configurable pool size (default: 10 connections)
- Lazy connection initialization
- Proper cleanup on close
Retry logic with exponential backoff:
- Max retries: 3 (configurable)
- Backoff schedule: 100ms, 200ms, 400ms
- Only retries transient failures (connection, timeout)
JSON serialization for complex data types
Configurable timeouts (default: 100ms)
Key namespacing with configurable prefix (default: llama_stack:)

src/llama_stack/providers/utils/cache/__init__.py

Export all public classes for easy importing

Tests (4 new test files, 55 test cases)

tests/unit/providers/utils/cache/test_cache_store.py (17 tests)

Test CacheError exception handling
Test CircuitBreaker state transitions and recovery logic
Test concurrent operations and failure scenarios

tests/unit/providers/utils/cache/test_memory_cache.py (18 tests)

Test all cache operations (get, set, delete, exists, ttl, clear, size)
Test eviction policies (LRU, LFU)
Test TTL expiration and cleanup
Test concurrent access
Test edge cases (invalid params, expired keys)

tests/unit/providers/utils/cache/test_redis_cache.py (20 tests)

Test Redis operations with mocked client
Test retry logic and failure handling
Test connection management
Test serialization/deserialization
Test configuration validation

Dependencies

pyproject.toml (modified)

Added cachetools>=5.5.0 - In-memory caching with LRU/LFU support
Added redis>=5.2.0 - Redis client with async support

Testing

Unit Tests

uv run --group dev pytest -sv tests/unit/providers/utils/cache/

Results:

✅ 55 tests passed
✅ >80% line coverage, >70% branch coverage
✅ All eviction policies tested (LRU, LFU, TTL-only)
✅ Circuit breaker state transitions verified
✅ Retry logic with exponential backoff verified
✅ Concurrent access scenarios tested

Checklist

Architecture Notes

Protocol-based design: Uses Python's Protocol for the cache store interface, allowing any implementation that satisfies the contract without requiring inheritance
Circuit breaker pattern: Prevents cascade failures by temporarily disabling cache operations after repeated failures, with automatic recovery attempts
Async-first API: All cache operations are async to support non-blocking I/O and integration with FastAPI middleware
Separation of concerns:
- Protocol defines interface (cache_store.py)
- Implementations are independent (memory.py, redis.py)
- Circuit breaker is reusable utility
Graceful degradation: Cache failures never block requests - errors are logged but handled gracefully

Security Considerations

FIPS compliant - Uses SHA-256 for any hashing (in circuit breaker state tracking)
No credentials in logs - Redis password not logged
Key namespacing - Redis cache uses configurable prefix to prevent key collisions

Implement reusable cache store abstraction with in-memory and Redis backends as foundation for prompt caching feature (PR1 of progressive delivery). - Add CacheStore protocol defining cache interface - Implement MemoryCacheStore with LRU, LFU, and TTL-only eviction policies - Implement RedisCacheStore with connection pooling and retry logic - Add CircuitBreaker for cache backend failure protection - Include comprehensive unit tests (55 tests, >80% coverage) - Add dependencies: cachetools>=5.5.0, redis>=5.2.0 This abstraction enables flexible caching implementations for the prompt caching middleware without coupling to specific storage backends. Signed-by: William Caban <willliam.caban@gmail.com>

bbrowning · 2025-11-15T21:28:04Z

Can you describe the intention of the prompt caching here? The kind of prompt caching I'm aware of is typically implemented in the actual inference provider, such as vLLM's prefix caching, llm-d's prefix aware caching and routing, OpenAI's prompt caching, etc. What the above generally does is route requests to servers that are most likely to have the tokens generated from the given input already cached.

Llama Stack isn't an inference server, and we don't deal with actual tokens. That would be the job of the inference server / service to handle the typical prompt caching scenario I outlined above. What is it you intend to store in this caching layer?

ashwinb · 2025-11-16T01:31:38Z

++@bbrowning, this does not appear to be the kind of prompt caching we want in the Stack at all.

williamcaban · 2025-11-16T14:42:10Z

The main intention is the implementation of classes to be used by the PR (future) for registring prompt templates (see example)

Component	Type	Usage
CacheStore	Protocol	Type hints, interface definition
MemoryCacheStore	Class	(inline) development/single-node cache
RedisCacheStore	Class	(remote) Production/distributed cache
CircuitBreaker	Class	Failure protection
CacheError	Exception	Error handling

Example of final state:

import mlflow
from llama_stack_client import LlamaStackClient

# Register prompt in MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.genai.register_prompt(
    name="helpful-assistant",
    template="You are a helpful AI assistant specialized in {{domain}}. "
             "Always provide accurate and concise answers.",
    tags={"category": "system-prompts"}
)

# Use prompt via Llama Stack
client = LlamaStackClient(base_url="http://localhost:8321")

# Get prompt from MLflow via Llama Stack
prompt = client.prompts.get(prompt_id="helpful-assistant")

# Use in chat completion - benefits from caching
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": prompt.prompt.replace("{{domain}}", "machine learning")},
        {"role": "user", "content": "Explain neural networks"}
    ]
)

In the case of OpenAI API (https://platform.openai.com/docs/guides/prompting#create-a-prompt), the prompt templates are registred through the OpenAI pllayground and then the id is referenced and consume through the regular messages.

In addition, the optional capability the prompt caching can provide when used outside prompt templates, replicates OpenAI Prompt Caching behavior (https://platform.openai.com/docs/guides/prompt-caching) to allows Llama Stack to serve as the prompt caching layer for inference providers that do not support it or have partial support. (The intent of this is NOT a prefix aware caching & routing mechanism)

Provider	Native Caching	OpenAI Format	Llama Stack Cache	Notes
OpenAI	Yes	Yes	Yes	Automatic prefix caching
Together	No	Yes	Yes	Manual caching via middleware
Anthropic	Yes	Partial	Yes	Uses different cache control API
Ollama	No	Yes	Yes	Manual caching via middleware
Bedrock	No	Yes	Yes	Manual caching via middleware
vLLM	Partial	Yes	Yes	Supports prefix caching in engine
Fireworks	No	Yes	Yes	Manual caching via middleware

Key:

Full support - Works as documented
Partial support - Limited or different implementation
No native support - Llama Stack provides caching layer

This PR implements Phase 1 of the prompt caching feature - automatic caching of prompt prefixes in OpenAI-compatible chat completion requests. **Key Features:** - Automatic caching of prompts ≥1024 tokens (configurable) - SHA-256 cache key computation (FIPS-compliant) - Multi-tenant isolation (tenant_id + user_id in cache keys) - Circuit breaker pattern for graceful degradation - Streaming request bypass (configurable) - Token counting integration (PR2) - Cache store abstraction integration (PR1) - OpenAI response schema updates (PR3) **Implementation:** - src/llama_stack/core/server/prompt_caching.py - tests/unit/server/test_prompt_caching.py - 25 comprehensive unit tests (100% passing) - >95% code coverage **Dependencies:** - Requires PR1 (cache-store-abstraction) - Requires PR2 (tokenization-utilities) - Requires PR3 (openai-response-schema) **Test Results:** - 25/25 unit tests passing - All pre-commit checks passing (mypy, ruff, ruff-format) Part of prompt caching implementation - Phase 1 of llamastack#4166 Signed-off-by: William Caban <william.caban@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com>

bbrowning · 2025-11-17T14:49:11Z

So, it feels like this is conflating two very different things.

Prompt caching is not something Llama Stack generally can or should do, as Llama Stack plays no role in turning requests into tokens.

Being able to reference prompts in Responses requests is something Llama Stack can do, and allowing those prompt references to be pulled from some place like MLflow as shown in your example above could be a useful feature.

mattf

@williamcaban using mlflow to back prompt management makes sense -

from openai import Client
import requests
import json

response = requests.request("POST", "http://localhost:8321",
    data=json.dumps({
        "prompt": "You are a helpful AI assistant specialized in {{domain}}. Always provide accurate and concise answers.",
        "variables": ["domain"]
    })
)

[you can inspect mlflow to find the prompt]

client = Client(base_url="http://localhost:8321", api_key="none")

response = client.responses.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt={"id": response["prompt_id"], "variables": {"domain": "machine learning"}}
    input="Explain neural networks"}
)

if the mlflow prompt provider wants to cache the prompts in the stack server, that could be useful.

however, the prompt caching feature is about caching the prompt's tokens, which stack cannot utilize. the only api i'm aware of that accepts tokens as input is /v1/embeddings.

franciscojavierarceo · 2025-11-17T15:17:08Z

So, it feels like this is conflating two very different things.

Prompt caching is not something Llama Stack generally can or should do, as Llama Stack plays no role in turning requests into tokens.

Being able to reference prompts in Responses requests is something Llama Stack can do, and allowing those prompt references to be pulled from some place like MLflow as shown in your example above could be a useful feature.

+1, I think a demo would sufficient to show how MLFlow and Llama Stack can play nicely together. Particularly for optimizing and evaluating various parameters within the stack and being able to track the performance on some hold out data of interest. You could then register (i.e., POST to /v1/prompts) the prompt with the optimal expected performance on said set of hold out data.

As others have noted, this isn't the traditional prompt caching that most users consider. Since our stack's Prompt API supports multiple KVStore backends you can already configure Redis as a DB for it in the run.yaml but I have not tried this myself.

williamcaban · 2025-11-18T03:06:22Z

Removing this PR for cache store and the PR for tokenization #4168 Will reworking #4170 to only bring a remote prompt registry be of interest?

mattf · 2025-11-18T13:20:51Z

Removing this PR for cache store and the PR for tokenization #4168 Will reworking #4170 to only bring a remote prompt registry be of interest?

@williamcaban yes pls

mergify · 2025-11-18T19:26:27Z

This pull request has merge conflicts that must be resolved before it can be merged. @williamcaban please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

williamcaban · 2025-11-23T17:22:44Z

Closing in favor of #4170 without caching options.

williamcaban requested review from ashwinb, bbrowning, ehhuang, franciscojavierarceo, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners November 15, 2025 19:59

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 15, 2025

williamcaban mentioned this pull request Nov 15, 2025

feat(inference): add tokenization utilities for prompt caching #4168

Closed

16 tasks

williamcaban mentioned this pull request Nov 16, 2025

feat: Add MLflow Prompt Registry provider #4170

Open

mattf requested changes Nov 17, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 18, 2025

williamcaban closed this Nov 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cache): add cache store abstraction layer #4166

feat(cache): add cache store abstraction layer #4166

Uh oh!

williamcaban commented Nov 15, 2025 •

edited

Loading

Uh oh!

bbrowning commented Nov 15, 2025

Uh oh!

ashwinb commented Nov 16, 2025

Uh oh!

williamcaban commented Nov 16, 2025

Uh oh!

bbrowning commented Nov 17, 2025

Uh oh!

mattf left a comment

Uh oh!

franciscojavierarceo commented Nov 17, 2025

Uh oh!

williamcaban commented Nov 18, 2025

Uh oh!

mattf commented Nov 18, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

williamcaban commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(cache): add cache store abstraction layer #4166

feat(cache): add cache store abstraction layer #4166

Uh oh!

Conversation

williamcaban commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAI-Compatible Prompt Caching Feature - Phase 1 (cache store layer)

Strategy

Changes

Core Implementation (4 new files)

Tests (4 new test files, 55 test cases)

Dependencies

Testing

Unit Tests

Checklist

Architecture Notes

Security Considerations

Uh oh!

bbrowning commented Nov 15, 2025

Uh oh!

ashwinb commented Nov 16, 2025

Uh oh!

williamcaban commented Nov 16, 2025

Uh oh!

bbrowning commented Nov 17, 2025

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

franciscojavierarceo commented Nov 17, 2025

Uh oh!

williamcaban commented Nov 18, 2025

Uh oh!

mattf commented Nov 18, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

williamcaban commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

williamcaban commented Nov 15, 2025 •

edited

Loading