Skip to content

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Nov 7, 2025

Summary

This PR implements a FallbackRouter that provides automatic failover between multiple LLM models when the primary model fails. This addresses the need for robust LLM usage with automatic fallback capabilities.

Related PRs

Key Features

  • Automatic Failover: Automatically falls back to secondary models when the primary model encounters errors (rate limits, connection failures, service unavailability, etc.)
  • Multiple Fallback Levels: Supports chaining multiple fallback models for robust failover
  • Telemetry Preservation: Preserves telemetry and metrics from the active model
  • Comprehensive Logging: Includes detailed logging of all failover attempts for debugging

Implementation Details

New Components

  1. FallbackRouter Class (openhands/sdk/llm/router/impl/fallback.py)

    • Extends RouterLLM base class
    • Overrides completion() method to implement fallback logic
    • Validates that 'primary' key exists in llms_for_routing
    • Tracks active_llm for proper telemetry delegation
  2. Comprehensive Tests (tests/sdk/llm/test_fallback_router.py)

    • 8 unit tests covering all scenarios:
      • Router creation and validation
      • Successful primary model completion
      • Fallback on rate limit errors
      • Fallback on connection errors
      • Error propagation when all models fail
      • Multiple fallback levels
      • select_llm method behavior
  3. Usage Example (examples/01_standalone_sdk/27_llm_fallback.py)

    • Demonstrates how to configure primary and fallback models
    • Shows proper logging setup to observe failover behavior
    • Includes comments explaining the configuration

How It Works

from openhands.sdk.llm import LLM
from openhands.sdk.llm.router import FallbackRouter
from pydantic import SecretStr

# Create primary and fallback LLMs
primary = LLM(model="gpt-4", api_key=SecretStr("key"), usage_id="primary")
fallback = LLM(model="gpt-3.5-turbo", api_key=SecretStr("key"), usage_id="fallback")

# Create FallbackRouter
router = FallbackRouter(
    usage_id="my-router",
    llms_for_routing={"primary": primary, "fallback": fallback}
)

# Use router like a regular LLM - it will automatically fall back if primary fails
response = router.completion(messages=[...])

Design Decisions

  1. RouterLLM Pattern: Implemented as a router (similar to MultimodalRouter and RandomRouter) rather than modifying the base LLM class, keeping changes modular and non-invasive.

  2. Primary Key Requirement: Requires "primary" key in llms_for_routing dictionary for clear intent and consistent behavior.

  3. Error Handling: Catches all exceptions during completion attempts and tries the next model, only raising the last exception if all models fail.

  4. Telemetry: Delegates telemetry to the active LLM through the base RouterLLM class's __getattr__ mechanism.

Testing

All tests pass successfully:

$ uv run pytest tests/sdk/llm/test_fallback_router.py -v
========== 8 passed in 0.05s ==========

Tests include:

  • ✅ Router creation with proper validation
  • ✅ Successful completion with primary model
  • ✅ Fallback on rate limit errors
  • ✅ Fallback on connection errors
  • ✅ Error propagation when all models fail
  • ✅ Multiple fallback model chaining
  • ✅ Proper select_llm behavior

Related Issues

This implements fallback language model functionality as requested, which was not previously supported in the SDK.

Checklist

  • Implementation follows existing patterns (RouterLLM)
  • Comprehensive unit tests added
  • Example demonstrating usage included
  • Code passes all pre-commit hooks (ruff, pyright, pycodestyle)
  • Well-documented with docstrings
  • No breaking changes to existing functionality
  • Documentation PR created (Document FallbackRouter for automatic LLM failover docs#93)

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:ce8e169-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-ce8e169-python \
  ghcr.io/openhands/agent-server:ce8e169-python

All tags pushed for this build

ghcr.io/openhands/agent-server:ce8e169-golang-amd64
ghcr.io/openhands/agent-server:ce8e169-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:ce8e169-golang-arm64
ghcr.io/openhands/agent-server:ce8e169-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:ce8e169-java-amd64
ghcr.io/openhands/agent-server:ce8e169-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:ce8e169-java-arm64
ghcr.io/openhands/agent-server:ce8e169-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:ce8e169-python-amd64
ghcr.io/openhands/agent-server:ce8e169-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:ce8e169-python-arm64
ghcr.io/openhands/agent-server:ce8e169-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:ce8e169-golang
ghcr.io/openhands/agent-server:ce8e169-java
ghcr.io/openhands/agent-server:ce8e169-python

About Multi-Architecture Support

  • Each variant tag (e.g., ce8e169-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., ce8e169-python-amd64) are also available if needed

This commit implements a FallbackRouter that provides automatic failover
between multiple LLM models when the primary model fails. Key features:

- Automatically falls back to secondary models on errors (rate limits,
  connection failures, service unavailable, etc.)
- Supports multiple fallback models in a chain
- Preserves telemetry and metrics from the active model
- Includes comprehensive logging of failover attempts

Implementation:
- New FallbackRouter class extending RouterLLM
- Overrides completion() to implement fallback logic
- Validates that 'primary' key exists in llms_for_routing
- Tracks active_llm for telemetry purposes

Tests:
- 8 comprehensive unit tests covering all scenarios
- Mocked LLM responses to avoid actual API calls
- Tests for successful completion, fallback scenarios, and error cases

Example:
- examples/01_standalone_sdk/27_llm_fallback.py demonstrates usage
- Shows how to configure primary and fallback models
- Includes logging setup to observe failover behavior

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm/router/impl
   fallback.py462643%43–46, 52–54, 61, 75, 77–78, 80–81, 85, 87, 95, 99, 101–103, 108–109, 112, 114, 120–121
TOTAL11831547153% 

- Change llms parameter from dictionary to list for simpler API
- Models are tried in list order (similar to litellm's approach)
- Internally converts list to dict for base class compatibility
- Update validator to check for empty list instead of missing 'primary' key
- Update logging to show model index (1/N, 2/N, etc.)
- Update example and tests to use new list-based API
- Update documentation to reflect list-based approach

This makes the API more intuitive and consistent with litellm's pattern.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 7, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Check Documented Examples

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1103 at branch `feature/llm-fallback-router`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants