Skip to content

Conversation

@williamcaban
Copy link

@williamcaban williamcaban commented Nov 15, 2025

OpenAI-Compatible Prompt Caching Feature - Phase 1 (cache store layer)

This PR implements the Cache Store Abstraction Layer for the OpenAI-compatible prompt caching feature. It provides a protocol-based abstraction for cache storage backends, enabling flexible implementations (memory, Redis) that will be used by the prompt caching middleware in an upcoming PR.

This is the first of various progressive PRs implementing prompt caching as outlined in the following strategy.

Strategy

Implementation strategy for extending the Llama Stack OpenAI-compatible API to support Prompt Caching (as per OpenAI's implementation) while integrating with MLflow's prompt registry for prompt management and versioning.

  1. Enable OpenAI-style Prompt Caching: Automatically cache prompt prefixes longer than 1,024 tokens (configurable) to reduce latency and costs
  2. Integrate MLflow Prompt Registry: Use MLflow as an external provider for prompt storage, versioning, and management
  3. Maintain OpenAI API Compatibility: Ensure compatibility with OpenAI's response format including cached_tokens in usage statistics
  4. Provider-agnostic Design: Support caching across multiple inference providers (OpenAI, Anthropic, Together, Ollama, etc.)

Changes

Core Implementation (4 new files)

src/llama_stack/providers/utils/cache/cache_store.py (~250 lines)

  • Define CacheStore protocol with async methods: get, set, delete, exists, ttl, clear, size
  • Add CacheError exception for graceful error handling
  • Implement CircuitBreaker pattern for failure protection:
    • Configurable failure threshold (default: 10 failures)
    • Automatic recovery timeout (default: 60 seconds)
    • Three states: CLOSED, OPEN, HALF_OPEN
    • Prevents cascade failures when cache backend is unavailable

src/llama_stack/providers/utils/cache/memory.py (~370 lines)

  • In-memory cache store using cachetools library
  • Support for multiple eviction policies:
    • LRU (Least Recently Used) - default
    • LFU (Least Frequently Used)
    • TTL-only (Time-based expiration)
  • Configurable limits:
    • max_entries (default: 1000)
    • max_memory_mb (default: 512MB, soft limit)
    • default_ttl (default: 600 seconds)
  • Thread-safe for concurrent access
  • Automatic expired entry cleanup

src/llama_stack/providers/utils/cache/redis.py (~460 lines)

  • Production-ready Redis cache store
  • Connection pooling for efficient resource usage:
    • Configurable pool size (default: 10 connections)
    • Lazy connection initialization
    • Proper cleanup on close
  • Retry logic with exponential backoff:
    • Max retries: 3 (configurable)
    • Backoff schedule: 100ms, 200ms, 400ms
    • Only retries transient failures (connection, timeout)
  • JSON serialization for complex data types
  • Configurable timeouts (default: 100ms)
  • Key namespacing with configurable prefix (default: llama_stack:)

src/llama_stack/providers/utils/cache/__init__.py

  • Export all public classes for easy importing

Tests (4 new test files, 55 test cases)

tests/unit/providers/utils/cache/test_cache_store.py (17 tests)

  • Test CacheError exception handling
  • Test CircuitBreaker state transitions and recovery logic
  • Test concurrent operations and failure scenarios

tests/unit/providers/utils/cache/test_memory_cache.py (18 tests)

  • Test all cache operations (get, set, delete, exists, ttl, clear, size)
  • Test eviction policies (LRU, LFU)
  • Test TTL expiration and cleanup
  • Test concurrent access
  • Test edge cases (invalid params, expired keys)

tests/unit/providers/utils/cache/test_redis_cache.py (20 tests)

  • Test Redis operations with mocked client
  • Test retry logic and failure handling
  • Test connection management
  • Test serialization/deserialization
  • Test configuration validation

Dependencies

pyproject.toml (modified)

  • Added cachetools>=5.5.0 - In-memory caching with LRU/LFU support
  • Added redis>=5.2.0 - Redis client with async support

Testing

Unit Tests

uv run --group dev pytest -sv tests/unit/providers/utils/cache/

Results:

  • ✅ 55 tests passed
  • ✅ >80% line coverage, >70% branch coverage
  • ✅ All eviction policies tested (LRU, LFU, TTL-only)
  • ✅ Circuit breaker state transitions verified
  • ✅ Retry logic with exponential backoff verified
  • ✅ Concurrent access scenarios tested

Checklist

  • All unit tests pass (55/55)
  • Pre-commit hooks pass (except Bash 4.0 script - system limitation)
  • Code coverage >80% line, >70% branch
  • Type checking passes (mypy)
  • Documentation updated (docstrings for all public methods)
  • Follows code style guidelines:
    • FIPS compliance (SHA-256, no MD5/SHA1)
    • Custom logging via llama_stack.log
    • Error messages prefixed with "Failed to ..."
    • Type hints for all public methods
    • Keyword arguments when calling functions
    • ASCII-only (no Unicode)
    • Meaningful comments
  • No breaking changes
  • Dependencies added to pyproject.toml

Architecture Notes

  1. Protocol-based design: Uses Python's Protocol for the cache store interface, allowing any implementation that satisfies the contract without requiring inheritance

  2. Circuit breaker pattern: Prevents cascade failures by temporarily disabling cache operations after repeated failures, with automatic recovery attempts

  3. Async-first API: All cache operations are async to support non-blocking I/O and integration with FastAPI middleware

  4. Separation of concerns:

    • Protocol defines interface (cache_store.py)
    • Implementations are independent (memory.py, redis.py)
    • Circuit breaker is reusable utility
  5. Graceful degradation: Cache failures never block requests - errors are logged but handled gracefully

Security Considerations

  • FIPS compliant - Uses SHA-256 for any hashing (in circuit breaker state tracking)
  • No credentials in logs - Redis password not logged
  • Key namespacing - Redis cache uses configurable prefix to prevent key collisions

Implement reusable cache store abstraction with in-memory and Redis
backends as foundation for prompt caching feature (PR1 of progressive delivery).

- Add CacheStore protocol defining cache interface
- Implement MemoryCacheStore with LRU, LFU, and TTL-only eviction policies
- Implement RedisCacheStore with connection pooling and retry logic
- Add CircuitBreaker for cache backend failure protection
- Include comprehensive unit tests (55 tests, >80% coverage)
- Add dependencies: cachetools>=5.5.0, redis>=5.2.0

This abstraction enables flexible caching implementations for the prompt
caching middleware without coupling to specific storage backends.

Signed-by: William Caban <willliam.caban@gmail.com>
@bbrowning
Copy link
Collaborator

Can you describe the intention of the prompt caching here? The kind of prompt caching I'm aware of is typically implemented in the actual inference provider, such as vLLM's prefix caching, llm-d's prefix aware caching and routing, OpenAI's prompt caching, etc. What the above generally does is route requests to servers that are most likely to have the tokens generated from the given input already cached.

Llama Stack isn't an inference server, and we don't deal with actual tokens. That would be the job of the inference server / service to handle the typical prompt caching scenario I outlined above. What is it you intend to store in this caching layer?

@ashwinb
Copy link
Contributor

ashwinb commented Nov 16, 2025

++@bbrowning, this does not appear to be the kind of prompt caching we want in the Stack at all.

@williamcaban
Copy link
Author

The main intention is the implementation of classes to be used by the PR (future) for registring prompt templates (see example)

Component Type Usage
CacheStore Protocol Type hints, interface definition
MemoryCacheStore Class (inline) development/single-node cache
RedisCacheStore Class (remote) Production/distributed cache
CircuitBreaker Class Failure protection
CacheError Exception Error handling

Example of final state:

import mlflow
from llama_stack_client import LlamaStackClient

# Register prompt in MLflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.genai.register_prompt(
    name="helpful-assistant",
    template="You are a helpful AI assistant specialized in {{domain}}. "
             "Always provide accurate and concise answers.",
    tags={"category": "system-prompts"}
)

# Use prompt via Llama Stack
client = LlamaStackClient(base_url="http://localhost:8321")

# Get prompt from MLflow via Llama Stack
prompt = client.prompts.get(prompt_id="helpful-assistant")

# Use in chat completion - benefits from caching
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": prompt.prompt.replace("{{domain}}", "machine learning")},
        {"role": "user", "content": "Explain neural networks"}
    ]
)

In the case of OpenAI API (https://platform.openai.com/docs/guides/prompting#create-a-prompt), the prompt templates are registred through the OpenAI pllayground and then the id is referenced and consume through the regular messages.

In addition, the optional capability the prompt caching can provide when used outside prompt templates, replicates OpenAI Prompt Caching behavior (https://platform.openai.com/docs/guides/prompt-caching) to allows Llama Stack to serve as the prompt caching layer for inference providers that do not support it or have partial support. (The intent of this is NOT a prefix aware caching & routing mechanism)

Provider Native Caching OpenAI Format Llama Stack Cache Notes
OpenAI Yes Yes Yes Automatic prefix caching
Together No Yes Yes Manual caching via middleware
Anthropic Yes Partial Yes Uses different cache control API
Ollama No Yes Yes Manual caching via middleware
Bedrock No Yes Yes Manual caching via middleware
vLLM Partial Yes Yes Supports prefix caching in engine
Fireworks No Yes Yes Manual caching via middleware

Key:

  • Full support - Works as documented
  • Partial support - Limited or different implementation
  • No native support - Llama Stack provides caching layer

williamcaban added a commit to williamcaban/llama-stack that referenced this pull request Nov 16, 2025
This PR implements Phase 1 of the prompt caching feature - automatic
caching of prompt prefixes in OpenAI-compatible chat completion requests.

**Key Features:**
- Automatic caching of prompts ≥1024 tokens (configurable)
- SHA-256 cache key computation (FIPS-compliant)
- Multi-tenant isolation (tenant_id + user_id in cache keys)
- Circuit breaker pattern for graceful degradation
- Streaming request bypass (configurable)
- Token counting integration (PR2)
- Cache store abstraction integration (PR1)
- OpenAI response schema updates (PR3)

**Implementation:**
- src/llama_stack/core/server/prompt_caching.py
- tests/unit/server/test_prompt_caching.py
- 25 comprehensive unit tests (100% passing)
- >95% code coverage

**Dependencies:**
- Requires PR1 (cache-store-abstraction)
- Requires PR2 (tokenization-utilities)
- Requires PR3 (openai-response-schema)

**Test Results:**
- 25/25 unit tests passing
- All pre-commit checks passing (mypy, ruff, ruff-format)

Part of prompt caching implementation - Phase 1 of llamastack#4166

Signed-off-by: William Caban <william.caban@gmail.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@bbrowning
Copy link
Collaborator

So, it feels like this is conflating two very different things.

Prompt caching is not something Llama Stack generally can or should do, as Llama Stack plays no role in turning requests into tokens.

Being able to reference prompts in Responses requests is something Llama Stack can do, and allowing those prompt references to be pulled from some place like MLflow as shown in your example above could be a useful feature.

Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamcaban using mlflow to back prompt management makes sense -

from openai import Client
import requests
import json

response = requests.request("POST", "http://localhost:8321",
    data=json.dumps({
        "prompt": "You are a helpful AI assistant specialized in {{domain}}. Always provide accurate and concise answers.",
        "variables": ["domain"]
    })
)

[you can inspect mlflow to find the prompt]

client = Client(base_url="http://localhost:8321", api_key="none")

response = client.responses.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    prompt={"id": response["prompt_id"], "variables": {"domain": "machine learning"}}
    input="Explain neural networks"}
)

if the mlflow prompt provider wants to cache the prompts in the stack server, that could be useful.

however, the prompt caching feature is about caching the prompt's tokens, which stack cannot utilize. the only api i'm aware of that accepts tokens as input is /v1/embeddings.

@franciscojavierarceo
Copy link
Collaborator

So, it feels like this is conflating two very different things.

Prompt caching is not something Llama Stack generally can or should do, as Llama Stack plays no role in turning requests into tokens.

Being able to reference prompts in Responses requests is something Llama Stack can do, and allowing those prompt references to be pulled from some place like MLflow as shown in your example above could be a useful feature.

+1, I think a demo would sufficient to show how MLFlow and Llama Stack can play nicely together. Particularly for optimizing and evaluating various parameters within the stack and being able to track the performance on some hold out data of interest. You could then register (i.e., POST to /v1/prompts) the prompt with the optimal expected performance on said set of hold out data.

As others have noted, this isn't the traditional prompt caching that most users consider. Since our stack's Prompt API supports multiple KVStore backends you can already configure Redis as a DB for it in the run.yaml but I have not tried this myself.

@williamcaban
Copy link
Author

Removing this PR for cache store and the PR for tokenization #4168 Will reworking #4170 to only bring a remote prompt registry be of interest?

@mattf
Copy link
Collaborator

mattf commented Nov 18, 2025

Removing this PR for cache store and the PR for tokenization #4168 Will reworking #4170 to only bring a remote prompt registry be of interest?

@williamcaban yes pls

@mergify
Copy link

mergify bot commented Nov 18, 2025

This pull request has merge conflicts that must be resolved before it can be merged. @williamcaban please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 18, 2025
@williamcaban
Copy link
Author

Closing in favor of #4170 without caching options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants