Add FallbackRouter for LLM failover support #1103
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a
FallbackRouterthat provides automatic failover between multiple LLM models when the primary model fails. This addresses the need for robust LLM usage with automatic fallback capabilities.Related PRs
Key Features
Implementation Details
New Components
FallbackRouter Class (
openhands/sdk/llm/router/impl/fallback.py)RouterLLMbase classcompletion()method to implement fallback logic'primary'key exists inllms_for_routingactive_llmfor proper telemetry delegationComprehensive Tests (
tests/sdk/llm/test_fallback_router.py)Usage Example (
examples/01_standalone_sdk/27_llm_fallback.py)How It Works
Design Decisions
RouterLLM Pattern: Implemented as a router (similar to
MultimodalRouterandRandomRouter) rather than modifying the baseLLMclass, keeping changes modular and non-invasive.Primary Key Requirement: Requires
"primary"key inllms_for_routingdictionary for clear intent and consistent behavior.Error Handling: Catches all exceptions during completion attempts and tries the next model, only raising the last exception if all models fail.
Telemetry: Delegates telemetry to the active LLM through the base
RouterLLMclass's__getattr__mechanism.Testing
All tests pass successfully:
$ uv run pytest tests/sdk/llm/test_fallback_router.py -v ========== 8 passed in 0.05s ==========Tests include:
Related Issues
This implements fallback language model functionality as requested, which was not previously supported in the SDK.
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:ce8e169-pythonRun
All tags pushed for this build
About Multi-Architecture Support
ce8e169-python) is a multi-arch manifest supporting both amd64 and arm64ce8e169-python-amd64) are also available if needed