-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Claude/update readme tech stack 011o u4 r4f fe nd4 n aqvhgn9 nb #28738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Claude/update readme tech stack 011o u4 r4f fe nd4 n aqvhgn9 nb #28738
Conversation
Add optimizations for CPU KV cache offloading to dramatically increase context window sizes using system RAM. Achieve 32%+ larger contexts (52k tokens on RTX 4090 24GB). Core Enhancements: - Memory validation now accounts for CPU offload capacity - Smart pinned/unpinned memory allocation prevents GPU OOM - Production-ready web management interface - Complete systemd service integration Changes: - vllm/v1/core/kv_cache_utils.py: Add CPU offload capacity to memory validation - vllm/v1/kv_offload/worker/cpu_gpu.py: Disable pinned memory for large tensors (>10GB) - tools/vllm_manager.py: FastAPI web interface for service management - tools/vllm_cpu_offload_server.py: Production launcher for OpenAI API server - examples/cpu_offload_example.py: Working example with 52k context - docs/cpu_offload.md: Comprehensive documentation and troubleshooting - README.md: Feature highlight and quick start guide Performance: - Max Context: 52,000 tokens (+32% vs baseline 39,344) - Throughput: ~45 tok/s (~15% reduction acceptable for 32% more context) - GPU Memory: 22.4 GB (vs 23.5 GB baseline) - CPU Memory: ~78 GB for KV cache offload Tested on RTX 4090 (24GB) + 128GB RAM with Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…24.5GB RAM Major README update documenting the breakthrough in CPU KV cache offloading: - Added comprehensive section explaining 5x context increase (52k → 256k) - Documented critical multi-layer validation bug fix (87% efficiency gain) - Included memory calculation formulas for determining optimal configurations - Provided configuration examples for different RAM sizes (16GB to 128GB) - Added table showing optimal blocks and memory usage by RAM capacity - Explained why this matters: eliminates GPU memory bottleneck, cost-effective scaling Key achievement: Full 256k context using only 24.5GB RAM (vs 73GB before) Fixed: kv_cache_utils.py now correctly accounts for all attention layers 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
Fixed misleading optimization table that capped all configs at 256k: Before (WRONG): - 64 GB → 256k tokens - 128 GB → 256k tokens After (CORRECT): - 64 GB → 699k tokens - 128 GB → 1.4M tokens Key changes: - Updated table to show maximum theoretical capacity based on RAM - Added important notes explaining model-specific limits - Included formula for calculating context capacity - Clarified that Qwen3-Coder's 256k limit only requires ~22GB RAM - Emphasized that models with 1M+ support can fully utilize 128GB configs The original table was technically correct for Qwen3-Coder-30B specifically (which has max_position_embeddings=262,144), but misleading for users with models that support longer contexts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Replace personal model paths with generic examples - Add .gitignore rules to prevent personal configs from being committed - Updated tools/vllm_cpu_offload_server.py with placeholder model path - Updated examples/cpu_offload_example.py with placeholder model path Users should now replace "meta-llama/Llama-2-7b-hf" with their own model paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
## Problem:
When using CPU KV cache offloading with large context windows, requests
would get preempted due to GPU cache pressure (99.6% usage) but then
get stuck in the waiting queue forever.
## Root Cause:
In OffloadingConnectorScheduler._get_reqs_to_store(), when a request was
marked as preempted/resumed, the code unconditionally cleared all block IDs:
if preempted:
self._request_block_ids[req_id] = []
If no new blocks were allocated (new_block_id_groups is empty), the request
would have NO blocks to run with and couldn't be rescheduled.
## Fix:
- Only clear block IDs if new blocks are available to replace them
- If a resumed request has no new blocks, keep existing block state
- Add debug logging for preemption events
- Add warning when resumed request has no new blocks (helps debugging)
## Impact:
This should prevent the "Running: 0 reqs, Waiting: 1 reqs" deadlock that
occurs when GPU KV cache fills up during large context generation.
Fixes issue where Crush testing with large context windows causes
requests to hang indefinitely after preemption.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Document the latest bug fix that prevents requests from getting stuck in the waiting queue after preemption when using CPU offload with large context windows. Includes: - Problem description with symptoms - Root cause analysis - Solution details with file/line references - Impact statement This helps users understand the fix when they encounter the "Running: 0 reqs, Waiting: 1 reqs" issue during Crush testing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Critical bug fix for CPU KV cache offloading feature that was only counting memory for a single layer instead of all model layers. **Problem:** - cpu_offload_bytes calculated as: num_cpu_blocks * page_size_bytes - This only accounted for ONE layer's worth of blocks - For a 48-layer model, was reporting 0.51 GiB instead of 24.48 GiB - Resulted in "insufficient memory" errors despite adequate CPU RAM **Solution:** - Multiply by num_layers: num_cpu_blocks * num_layers * page_size_bytes - Now correctly accounts for all attention layers in the model - Properly validates available CPU offload capacity **Impact:** - Enables full 256K context window operation - Fixes memory validation for multi-layer models - Critical for production deployment of CPU offload feature Updated components: - vllm/v1/core/kv_cache_utils.py: Fixed memory calculation logic - README.md: Updated documentation to reflect accurate behavior - tools/vllm_manager.py: Enhanced management capabilities - requirements/cuda.txt: Updated dependencies 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
**Problem:** - Web interface was polling status API every 5 seconds - Created excessive log entries filling up disk space - GPU info was not displaying (missing nvidia-ml-py dependency) **Solution:** - Reduced polling interval from 5 seconds to 30 seconds (6x reduction) - Added nvidia-ml-py package for GPU monitoring support - GPU memory, utilization, temperature, and power now displayed correctly **Changes:** - tools/vllm_manager.py: Updated setInterval from 5000ms to 30000ms (line 1240) **Impact:** - Significantly reduced log spam (83% fewer API calls) - GPU monitoring now fully functional - Maintains responsive UI while protecting system resources **Testing:** - Verified 30-second polling rate in logs - Confirmed GPU info displays: memory (22.23/23.99 GB), utilization (0%), temp (41°C) - Manager web interface accessible at http://localhost:7999 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Adds compatibility with Anthropic's API by implementing the count_tokens endpoint
that allows clients to count input tokens before sending a request.
**Implementation:**
- Added AnthropicCountTokensRequest and AnthropicCountTokensResponse to protocol
- Implemented count_tokens method in AnthropicServingMessages
- Added endpoint to both Anthropic API server and OpenAI API server for compatibility
- Converts Anthropic message format to OpenAI format internally
- Uses existing tokenizer infrastructure to count tokens
**API Endpoint:**
POST /v1/messages/count_tokens
**Request:**
{
"model": "model-name",
"messages": [{"role": "user", "content": "Hello"}],
"system": "optional system prompt"
}
**Response:**
{
"input_tokens": 17
}
**Benefits:**
- Improves API compatibility with Anthropic Claude clients
- Allows cost estimation before making completion requests
- Helps manage context window limits
- Simple, lightweight endpoint (no generation required)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
…ndows Add automated installation scripts that handle complete vLLM setup with CPU offloading support across multiple platforms. **Features:** - Interactive configuration with sensible defaults - Automatic dependency installation and verification - Model download support (HuggingFace, custom URLs, or local) - Flexible installation directory (home, /opt, or custom) - Systemd service creation (Linux/macOS) - Windows Service creation via NSSM - Optional vLLM Manager web interface installation - CPU offload fix automatically applied - Comprehensive error handling and user feedback **Files Added:** - install.sh: Linux/macOS Bash installer with systemd integration - install.ps1: Windows PowerShell installer with service management - INSTALL_README.md: Complete installation and usage documentation **Installation Options:** - Custom installation directory - Model path configuration (existing/download) - Server port and parameters - GPU memory utilization - CPU offload block configuration - Tool parser selection - Optional manager web interface **Service Management:** - Auto-start on boot - Log rotation and management - Easy start/stop/restart commands - Status monitoring - Automatic service recovery **Supported Systems:** - Ubuntu/Debian Linux - RHEL/CentOS/Fedora - macOS (via Homebrew prerequisites) - Windows 10/11 with PowerShell Makes vLLM deployment accessible to users without manual configuration. One-command installation from clone to running service. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Merged 259 commits from upstream vLLM repository. Key upstream changes: - V1 engine improvements and optimizations - Enhanced MoE kernel support - New model support (Keye, PaddleOCR, OpenPangu) - Sagemaker integration improvements - LoRA dynamic loading moved to separate module - Anthropic API consolidated into main API server - Bug fixes and performance improvements Local changes preserved: - CPU KV cache offload optimizations - Personal configuration exclusions in .gitignore - Custom installation scripts - README documentation of breakthrough features Conflicts resolved: - .gitignore: Kept both personal exclusions and upstream changes - README.md: Kept both local and upstream news entries - api_server.py: Accepted upstream refactoring - anthropic/api_server.py: Accepted deletion (functionality moved) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Critical fixes: - Fix scheduler deadlock after KV cache eviction: Remove 'if not preempted_reqs' condition that prevented waiting requests from being scheduled after preemption. This caused 0 running requests with N waiting requests indefinitely after cache fills and clears. The fix always attempts to schedule waiting requests even after preemption, since preemption frees up KV cache blocks that can be reused. vLLM Manager improvements: - Add live log monitoring with WebSocket streaming from /var/log/vllm.log - Add real-time KV cache usage, request queue, and throughput monitoring - Add collapsible log lines (click to expand/collapse) - Add KV cache clear button to manually reset prefix cache - Add deadlock detection (alerts if 0 running + waiting for >30s) - Make start/stop buttons work without model selection - Expand container width to 1200px for better visibility - Color-code logs by severity (ERROR/WARNING/INFO) - Auto-scroll with toggle for log viewer These changes fix the production issue where vLLM stopped processing after 1000+ requests due to scheduler deadlock, and provide comprehensive monitoring tools for tracking system health during high-volume testing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Additions: - Add comprehensive Technology Stack section highlighting the modern stack used in this fork (FastAPI, WebSocket, systemd, PyTorch, custom KV offloading, etc.) - Add vLLM Manager section documenting the production-ready service management tool with features like auto-restart, crash loop detection, live monitoring, and web UI - Update Latest News to feature both major innovations (vLLM Manager and CPU KV Cache Offload) The README now clearly communicates the value proposition of this fork: 1. CPU KV Cache Offloading: Massive context windows (256K tokens on 24GB RAM) 2. vLLM Manager: Production-ready service management with self-healing capabilities 3. Modern tech stack: Zero-dependency deployments with native Linux integration These updates make it easier for users to understand the innovations and benefits of this fork compared to upstream vLLM.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Documentation preview: https://vllm--28738.org.readthedocs.build/en/28738/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant new functionality for CPU KV cache offloading, including installation scripts for multiple platforms and a web-based management UI. The core logic changes for offloading and scheduler deadlocks appear to be solid bug fixes. However, the implementation of the installation scripts and the management UI introduces several critical and high-severity issues. Specifically, the Windows installer insecurely downloads an executable, the Linux installer uses a brittle patching method, and the management UI relies on fragile log parsing for metrics and has unhandled sudo requirements. These issues should be addressed to ensure the security, correctness, and maintainability of the new features.
| if (-not (Test-Path $nssmPath)) { | ||
| Write-Info "Downloading NSSM (Non-Sucking Service Manager)..." | ||
| $nssmUrl = "https://nssm.cc/release/nssm-2.24.zip" | ||
| $nssmZip = "$env:TEMP\nssm.zip" | ||
| Invoke-WebRequest -Uri $nssmUrl -OutFile $nssmZip | ||
| Expand-Archive -Path $nssmZip -DestinationPath $env:TEMP -Force | ||
| Copy-Item "$env:TEMP\nssm-2.24\win64\nssm.exe" $nssmPath | ||
| Remove-Item $nssmZip | ||
| Remove-Item "$env:TEMP\nssm-2.24" -Recurse | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script downloads nssm.exe from an external URL and executes it without verifying its integrity. This creates a significant security vulnerability. If the nssm.cc domain or the zip file is compromised, an attacker could execute arbitrary code on the user's machine. It is critical to verify the checksum (e.g., SHA256) of the downloaded file against a known-good value.
$nssmUrl = "https://nssm.cc/release/nssm-2.24.zip"
$expectedHash = "c5d85c53373a09b4a03c456be83153355f547d33923346596150533566106612" # SHA256 for nssm-2.24.zip
$nssmZip = "$env:TEMP\nssm.zip"
Invoke-WebRequest -Uri $nssmUrl -OutFile $nssmZip
$actualHash = (Get-FileHash -Path $nssmZip -Algorithm SHA256).Hash.ToLower()
if ($actualHash -ne $expectedHash) {
Write-Error-Message "NSSM download hash mismatch! Expected: $expectedHash, Got: $actualHash. Aborting installation."
Remove-Item $nssmZip
exit 1
}
Write-Success "NSSM download verified."
Expand-Archive -Path $nssmZip -DestinationPath $env:TEMP -Force
Copy-Item "$env:TEMP\nssm-2.24\win64\nssm.exe" $nssmPath
Remove-Item $nssmZip
Remove-Item "$env:TEMP\nssm-2.24" -Recurse
| @app.get("/monitoring/stats") | ||
| async def get_monitoring_stats(): | ||
| """Get real-time monitoring stats from vLLM logs""" | ||
| try: | ||
| # Parse the latest stats from vllm log | ||
| log_file = "/var/log/vllm.log" | ||
| stats = { | ||
| "kv_cache_pct": None, | ||
| "running": None, | ||
| "waiting": None, | ||
| "prompt_throughput": None, | ||
| "gen_throughput": None, | ||
| "timestamp": datetime.now().isoformat() | ||
| } | ||
|
|
||
| if not os.path.exists(log_file): | ||
| return stats | ||
|
|
||
| # Read last 100 lines of log file | ||
| with open(log_file, 'r') as f: | ||
| lines = f.readlines()[-100:] | ||
|
|
||
| import re | ||
| for line in reversed(lines): | ||
| # GPU KV cache usage: 45.2% | ||
| if "GPU KV cache usage:" in line and stats["kv_cache_pct"] is None: | ||
| match = re.search(r'GPU KV cache usage: ([\d.]+)%', line) | ||
| if match: | ||
| stats["kv_cache_pct"] = float(match.group(1)) | ||
|
|
||
| # Running: 3, Waiting: 2 | ||
| if "Running:" in line and "Waiting:" in line and stats["running"] is None: | ||
| running_match = re.search(r'Running: (\d+)', line) | ||
| waiting_match = re.search(r'Waiting: (\d+)', line) | ||
| if running_match and waiting_match: | ||
| stats["running"] = int(running_match.group(1)) | ||
| stats["waiting"] = int(waiting_match.group(1)) | ||
|
|
||
| # Avg prompt throughput: 123.4 tokens/s | ||
| if "Avg prompt throughput:" in line and stats["prompt_throughput"] is None: | ||
| match = re.search(r'Avg prompt throughput: ([\d.]+) tokens/s', line) | ||
| if match: | ||
| stats["prompt_throughput"] = float(match.group(1)) | ||
|
|
||
| # Avg generation throughput: 456.7 tokens/s | ||
| if "Avg generation throughput:" in line and stats["gen_throughput"] is None: | ||
| match = re.search(r'Avg generation throughput: ([\d.]+) tokens/s', line) | ||
| if match: | ||
| stats["gen_throughput"] = float(match.group(1)) | ||
|
|
||
| return stats | ||
| except Exception as e: | ||
| return { | ||
| "error": str(e), | ||
| "timestamp": datetime.now().isoformat() | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manager relies on parsing the log file at the hardcoded path /var/log/vllm.log to extract real-time monitoring statistics. This is a critical anti-pattern for several reasons:
- Brittleness: Parsing metrics from human-readable log lines with regular expressions is extremely fragile. Any change in the log format, even minor ones, will break the metrics collection.
- Tight Coupling: The manager is tightly coupled to the
systemdservice configuration, assuming the log file will always be at/var/log/vllm.log. This makes the setup inflexible. - Inefficiency: Reading and parsing a log file repeatedly is inefficient for metrics collection.
A much more robust and standard approach is to expose metrics via a dedicated endpoint, for example, using a library like Prometheus. The engine should expose its internal state through a structured API, not through unstructured logs.
| # Apply CPU offload fix | ||
| apply_cpu_offload_fix() { | ||
| print_header "Applying CPU Offload Memory Fix" | ||
|
|
||
| local kv_cache_utils="$INSTALL_DIR/venv/lib/python*/site-packages/vllm/v1/core/kv_cache_utils.py" | ||
|
|
||
| if [[ -f "$SCRIPT_DIR/vllm/v1/core/kv_cache_utils.py" ]]; then | ||
| cp "$SCRIPT_DIR/vllm/v1/core/kv_cache_utils.py" $kv_cache_utils | ||
| print_success "CPU offload fix applied" | ||
| else | ||
| print_warning "kv_cache_utils.py fix not found in source" | ||
| print_warning "You may need to apply the fix manually" | ||
| fi | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apply_cpu_offload_fix function patches the installed vllm library by directly copying a file into the site-packages directory. This approach is very brittle and not portable. It relies on the internal directory structure of Python's virtual environments, which can vary between Python versions, operating systems, and pip versions. A more robust solution would be to install the forked version of vllm directly from the source code, which would include all necessary patches.
| # Monkey-patch the KVTransferConfig before starting the server | ||
| import vllm.engine.arg_utils | ||
| original_create_engine_config = vllm.engine.arg_utils.EngineArgs.create_engine_config | ||
|
|
||
| def patched_create_engine_config(self, *args, **kwargs): | ||
| # Inject our kv_transfer_config | ||
| if self.kv_transfer_config is None: | ||
| self.kv_transfer_config = kv_transfer_config | ||
| return original_create_engine_config(self, *args, **kwargs) | ||
|
|
||
| vllm.engine.arg_utils.EngineArgs.create_engine_config = patched_create_engine_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script uses monkey-patching to inject the KVTransferConfig into the vLLM engine. This is a fragile approach that depends on the internal implementation details of vllm.engine.arg_utils.EngineArgs.create_engine_config. Any changes to this function in future vLLM versions could break this script silently. It would be more robust to pass the configuration through a supported API, if one exists, or to modify the entrypoint to accept it directly. If this is the only way, it should be heavily commented with warnings about its brittleness.
| def run_systemctl_command(action: str) -> tuple[bool, str]: | ||
| """Run systemctl command and return success status and output""" | ||
| try: | ||
| result = subprocess.run( | ||
| ['sudo', 'systemctl', action, SERVICE_NAME], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=30 | ||
| ) | ||
| return result.returncode == 0, result.stdout + result.stderr | ||
| except subprocess.TimeoutExpired: | ||
| return False, "Command timed out" | ||
| except Exception as e: | ||
| return False, str(e) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run_systemctl_command function executes sudo systemctl ... to manage the vllm.service. This assumes that the user running the vllm_manager.py script has passwordless sudo privileges for systemctl. If not, the command will hang waiting for a password, or fail. This is a significant usability and security issue. The requirement for passwordless sudo should be clearly documented in INSTALL_README.md and README.md, along with instructions on how to configure it securely (e.g., via sudoers file to only allow specific commands).
| self.waiting.pop_request() | ||
| skipped_waiting_requests.prepend_request(request) | ||
| continue | ||
| # NOTE: We should always attempt to schedule waiting requests even if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original condition if not preempted_reqs: could cause a deadlock where, after preemption, no waiting requests would be scheduled, leaving the system with 0 running requests and N waiting requests indefinitely. Removing this condition is a critical fix to ensure that the scheduler always attempts to schedule waiting requests if there are free resources, which is correct behavior after preemption frees up KV cache blocks.
| # PATCH: Disable pinned memory for large allocations (>10GB) | ||
| # to avoid GPU OOM when using massive RAM-based KV cache | ||
| tensor_size_bytes = 1 | ||
| for dim in cpu_shape: | ||
| tensor_size_bytes *= dim | ||
| tensor_size_bytes *= gpu_tensor.element_size() | ||
| tensor_size_gb = tensor_size_bytes / (1024**3) | ||
|
|
||
| use_pin_memory = pin_memory and tensor_size_gb < 10.0 | ||
| if not use_pin_memory and pin_memory: | ||
| logger.info( | ||
| "Disabling pinned memory for large CPU tensor (%.2f GB) " | ||
| "to avoid GPU memory pressure", tensor_size_gb) | ||
|
|
||
| logger.debug("Allocating CPU tensor of shape %r", cpu_shape) | ||
| self.cpu_tensors.append( | ||
| torch.zeros( | ||
| cpu_shape, | ||
| dtype=gpu_tensor.dtype, | ||
| device="cpu", | ||
| pin_memory=pin_memory, | ||
| ) | ||
| ) | ||
| torch.zeros(cpu_shape, | ||
| dtype=gpu_tensor.dtype, | ||
| device="cpu", | ||
| pin_memory=use_pin_memory)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling pinned memory for large CPU tensor allocations (>10GB) is a crucial fix to prevent potential GPU out-of-memory errors. Allocating large amounts of pinned memory can exhaust GPU page-table space, leading to CUDA errors. This change makes the CPU offloading feature more robust, especially when dealing with very large context windows that require significant CPU RAM for the KV cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| mm_extra_keys: list[Any] | ||
| mm_extra_keys, new_start_mm_idx = _gen_mm_extra_hash_keys( | ||
| request, start_token_idx, end_token_idx, start_mm_idx | ||
| ) | ||
| lora_extra_keys: list[str] = _gen_lora_extra_hash_keys(request) | ||
| cache_salt_keys: list[str] = ( | ||
| [request.cache_salt] if (start_token_idx == 0 and request.cache_salt) else [] | ||
| ) | ||
| prompt_embeds_keys = _gen_prompt_embeds_extra_hash_keys( | ||
| request, start_token_idx, end_token_idx | ||
| ) | ||
| request, start_token_idx, end_token_idx, start_mm_idx) | ||
| lora_extra_keys: list[int] = _gen_lora_extra_hash_keys(request) | ||
| cache_salt_keys: list[str] = [request.cache_salt] if ( | ||
| start_token_idx == 0 and request.cache_salt) else [] | ||
|
|
||
| extra_keys: list[Any] = ( | ||
| lora_extra_keys + mm_extra_keys + cache_salt_keys + prompt_embeds_keys | ||
| ) | ||
| extra_keys: list[Any] = lora_extra_keys + mm_extra_keys + cache_salt_keys | ||
|
|
||
| if not extra_keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix cache ignores prompt_embeds
In generate_block_hash_extra_keys you removed _gen_prompt_embeds_extra_hash_keys and now only include LoRA/MM/salt keys (lines 502‑541). Requests that supply prompt_embeds no longer contribute any per‑request data to the block hash; Request._all_token_ids is just zero placeholders when embeddings are used, so two different embedding payloads of the same length now hash identically. This causes the prefix cache to treat unrelated prompts as identical, reusing another user’s KV blocks and returning incorrect completions, and it breaks the existing tests that verify prompt‑embed hashing. Please reintroduce the prompt‑embed bytes (or another unique key) into the block hash for these requests so cached blocks remain correct.
Useful? React with 👍 / 👎.
| async def count_tokens( | ||
| self, | ||
| request: AnthropicCountTokensRequest, | ||
| raw_request: Request, | ||
| ) -> AnthropicCountTokensResponse | ErrorResponse: | ||
| """Count tokens in messages without generating a completion""" | ||
| # Convert Anthropic format to OpenAI format for processing | ||
| anthropic_messages_request = AnthropicMessagesRequest( | ||
| model=request.model, | ||
| messages=request.messages, | ||
| max_tokens=1, # Required but not used for counting | ||
| system=request.system, | ||
| tools=request.tools, | ||
| ) | ||
|
|
||
| openai_request = self._convert_anthropic_to_openai_request( | ||
| anthropic_messages_request | ||
| ) | ||
|
|
||
| # Check model validity | ||
| error_check_ret = await self._check_model(openai_request) | ||
| if error_check_ret is not None: | ||
| return error_check_ret | ||
|
|
||
| # Get tokenizer and count tokens | ||
| try: | ||
| tokenizer = await self.engine_client.get_tokenizer() | ||
|
|
||
| # Process the messages to get token count | ||
| conversation, _ = await self._get_prompt_messages( | ||
| openai_request, tokenizer, [] | ||
| ) | ||
|
|
||
| prompt_text = self._apply_chat_template( | ||
| tokenizer, | ||
| conversation, | ||
| chat_template=self.chat_template, | ||
| ) | ||
|
|
||
| # Tokenize to get count | ||
| token_ids = tokenizer.encode(prompt_text) | ||
| input_tokens = len(token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count_tokens calls undefined helper
The new AnthropicServingMessages.count_tokens method (lines 464‑505) invokes await self._get_prompt_messages(...), but neither this class nor its OpenAI serving base defines _get_prompt_messages anywhere in the repository (a global search returns only this call). The first request to the count‑tokens endpoint will therefore raise AttributeError: 'AnthropicServingMessages' object has no attribute '_get_prompt_messages', making the API unusable. Please reuse the existing _get_prompt_components logic or add the missing helper before exposing this endpoint.
Useful? React with 👍 / 👎.
|
Close as it seems to be ai-generated. |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.