Add FallbackRouter for LLM failover support #1103

neubig · 2025-11-07T21:09:16Z

Summary

This PR implements a FallbackRouter that provides automatic failover between multiple LLM models when the primary model fails. This addresses the need for robust LLM usage with automatic fallback capabilities.

Related PRs

Documentation PR: Document FallbackRouter for automatic LLM failover docs#93 - Documentation for FallbackRouter

Key Features

Automatic Failover: Automatically falls back to secondary models when the primary model encounters errors (rate limits, connection failures, service unavailability, etc.)
Multiple Fallback Levels: Supports chaining multiple fallback models for robust failover
Telemetry Preservation: Preserves telemetry and metrics from the active model
Comprehensive Logging: Includes detailed logging of all failover attempts for debugging

Implementation Details

New Components

FallbackRouter Class (openhands/sdk/llm/router/impl/fallback.py)
- Extends RouterLLM base class
- Overrides completion() method to implement fallback logic
- Validates that 'primary' key exists in llms_for_routing
- Tracks active_llm for proper telemetry delegation
Comprehensive Tests (tests/sdk/llm/test_fallback_router.py)
- 8 unit tests covering all scenarios:
  - Router creation and validation
  - Successful primary model completion
  - Fallback on rate limit errors
  - Fallback on connection errors
  - Error propagation when all models fail
  - Multiple fallback levels
  - select_llm method behavior
Usage Example (examples/01_standalone_sdk/27_llm_fallback.py)
- Demonstrates how to configure primary and fallback models
- Shows proper logging setup to observe failover behavior
- Includes comments explaining the configuration

How It Works

from openhands.sdk.llm import LLM
from openhands.sdk.llm.router import FallbackRouter
from pydantic import SecretStr

# Create primary and fallback LLMs
primary = LLM(model="gpt-4", api_key=SecretStr("key"), usage_id="primary")
fallback = LLM(model="gpt-3.5-turbo", api_key=SecretStr("key"), usage_id="fallback")

# Create FallbackRouter
router = FallbackRouter(
    usage_id="my-router",
    llms_for_routing={"primary": primary, "fallback": fallback}
)

# Use router like a regular LLM - it will automatically fall back if primary fails
response = router.completion(messages=[...])

Design Decisions

RouterLLM Pattern: Implemented as a router (similar to MultimodalRouter and RandomRouter) rather than modifying the base LLM class, keeping changes modular and non-invasive.
Primary Key Requirement: Requires "primary" key in llms_for_routing dictionary for clear intent and consistent behavior.
Error Handling: Catches all exceptions during completion attempts and tries the next model, only raising the last exception if all models fail.
Telemetry: Delegates telemetry to the active LLM through the base RouterLLM class's __getattr__ mechanism.

Testing

All tests pass successfully:

$ uv run pytest tests/sdk/llm/test_fallback_router.py -v
========== 8 passed in 0.05s ==========

Tests include:

✅ Router creation with proper validation
✅ Successful completion with primary model
✅ Fallback on rate limit errors
✅ Fallback on connection errors
✅ Error propagation when all models fail
✅ Multiple fallback model chaining
✅ Proper select_llm behavior

Related Issues

This implements fallback language model functionality as requested, which was not previously supported in the SDK.

Checklist

Implementation follows existing patterns (RouterLLM)
Comprehensive unit tests added
Example demonstrating usage included
Code passes all pre-commit hooks (ruff, pyright, pycodestyle)
Well-documented with docstrings
No breaking changes to existing functionality
Documentation PR created (Document FallbackRouter for automatic LLM failover docs#93)

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:ce8e169-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-ce8e169-python \
  ghcr.io/openhands/agent-server:ce8e169-python

All tags pushed for this build

ghcr.io/openhands/agent-server:ce8e169-golang-amd64
ghcr.io/openhands/agent-server:ce8e169-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:ce8e169-golang-arm64
ghcr.io/openhands/agent-server:ce8e169-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:ce8e169-java-amd64
ghcr.io/openhands/agent-server:ce8e169-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:ce8e169-java-arm64
ghcr.io/openhands/agent-server:ce8e169-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:ce8e169-python-amd64
ghcr.io/openhands/agent-server:ce8e169-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:ce8e169-python-arm64
ghcr.io/openhands/agent-server:ce8e169-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:ce8e169-golang
ghcr.io/openhands/agent-server:ce8e169-java
ghcr.io/openhands/agent-server:ce8e169-python

About Multi-Architecture Support

Each variant tag (e.g., ce8e169-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., ce8e169-python-amd64) are also available if needed

This commit implements a FallbackRouter that provides automatic failover between multiple LLM models when the primary model fails. Key features: - Automatically falls back to secondary models on errors (rate limits, connection failures, service unavailable, etc.) - Supports multiple fallback models in a chain - Preserves telemetry and metrics from the active model - Includes comprehensive logging of failover attempts Implementation: - New FallbackRouter class extending RouterLLM - Overrides completion() to implement fallback logic - Validates that 'primary' key exists in llms_for_routing - Tracks active_llm for telemetry purposes Tests: - 8 comprehensive unit tests covering all scenarios - Mocked LLM responses to avoid actual API calls - Tests for successful completion, fallback scenarios, and error cases Example: - examples/01_standalone_sdk/27_llm_fallback.py demonstrates usage - Shows how to configure primary and fallback models - Includes logging setup to observe failover behavior Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-11-07T21:11:48Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm/router/impl
fallback.py	46	26	43%	43–46, 52–54, 61, 75, 77–78, 80–81, 85, 87, 95, 99, 101–103, 108–109, 112, 114, 120–121
TOTAL	11831	5471	53%

- Change llms parameter from dictionary to list for simpler API - Models are tried in list order (similar to litellm's approach) - Internally converts list to dict for base class compatibility - Update validator to check for empty list instead of missing 'primary' key - Update logging to show model index (1/N, 2/N, etc.) - Update example and tests to use new list-based API - Update documentation to reflect list-based approach This makes the API more intuitive and consistent with litellm's pattern. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-07T23:02:57Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Check Documented Examples

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1103 at branch `feature/llm-fallback-router`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

neubig mentioned this pull request Nov 7, 2025

Document FallbackRouter for automatic LLM failover OpenHands/docs#93

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add FallbackRouter for LLM failover support #1103

Add FallbackRouter for LLM failover support #1103

Uh oh!

neubig commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

openhands-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add FallbackRouter for LLM failover support #1103

Are you sure you want to change the base?

Add FallbackRouter for LLM failover support #1103

Uh oh!

Conversation

neubig commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related PRs

Key Features

Implementation Details

New Components

How It Works

Design Decisions

Testing

Related Issues

Checklist

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Nov 7, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading