-
Notifications
You must be signed in to change notification settings - Fork 53
Support kimi-k2 extended thinking, fix prompt caching stats, fix max output #1133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
neubig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but could we create a test to check to make sure that reasoning content is indeed included?
- Test that kimi-k2-thinking model has send_reasoning_content=True - Test that reasoning_content is included when send_reasoning_content is True - Test that reasoning_content is excluded when send_reasoning_content is False - Test edge cases: None, empty string, and list serializer scenarios Co-authored-by: openhands <openhands@all-hands.dev>
🧪 Integration Tests ResultsOverall Success Rate: 90.6% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
litellm_proxy_gpt_5_mini_2025_08_07
Failed Tests:
litellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_claude_sonnet_4_5_20250929
|
| ) | ||
|
|
||
| temperature: float | None = Field(default=0.0, ge=0) | ||
| temperature: float | None = Field(default=1.0, ge=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this in their communications, but I... don't think 1 as default works so well for us, for the majority of LLMs?
Since it has reasoning, maybe we could use this code or around there:
out.pop("temperature", None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit ugly bit of code, but at least the comment says it all: "Reasoning-model quirks" 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a bit annoying because kimi-k2 doesn't support reasoning effort so we can't re-use the same path there 😅 ...
The only way around would be possibly hard-code it in the code (ugly), so go with set temperature to 1 for all models..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I see! One thought there, we eval at 0, and the user will not get the same experience we evaled for... well, by default in the SDK
I... actually suggest to hardcode it, I read the code again and we also check for mode name with "azure" (of course...), and for "mistral" and more in that method
Not a strong preference of course, but I kinda want to know why a thing happened, and having them all in one place helps with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OpenHands we're getting too many hardcoded cases in the method linked above, in chat_options.py. Think deeply and come up with 3 suggestions how we can improve this.
Do NOT modify this PR, just think and make a new issue with your proposals.
Updated test expectations to match the new default temperature of 1.0: - test_llm_config_defaults: Update default temperature assertion - test_llm_local_detection_based_on_model_name: Update temperature check - test_no_response_retry_bumps_temperature: Explicitly set temperature=0.0 in fixture to properly test the temperature bump behavior on retry Co-authored-by: openhands <openhands@all-hands.dev>
Coverage Report •
|
||||||||||||||||||||||||||||||||||||||||
|
@OpenHands please fix tests k/openhands/sdk/workspace/remote/remote_workspace_mixin.py 105 83 21% 31-32, 36-39, 60-144, 168-205, 228-263, 286-295, 313-321
|
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
- Update test_llm_config_defaults to expect temperature=0.0 - Update test_llm_local_detection_based_on_model_name to expect temperature=0.0 - Fix mock_llm fixture in condenser tests to include temperature attribute - Add comprehensive tests for get_default_temperature function: * Test kimi-k2-thinking models default to 1.0 * Test all other models default to 0.0 * Test with provider prefixes (moonshot/, litellm_proxy/) * Test case insensitivity * Test fallback for unknown models All previously failing tests now pass. Co-authored-by: openhands <openhands@all-hands.dev>
✅ Tests Fixed SuccessfullyI've successfully fixed all 7 failing tests and added comprehensive new tests for the Test Fixes (No Implementation Changes)
New Tests Added for
|
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
🧪 Integration Tests ResultsOverall Success Rate: 93.8% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_chat
Skipped Tests:
litellm_proxy_claude_sonnet_4_5_20250929
litellm_proxy_gpt_5_mini_2025_08_07
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
|
Fix max output error we see here:
#1016 (comment)
Fix prompt caching stats:
Fix extended thinking which require us to send back reasoning content: https://platform.moonshot.ai/docs/guide/use-kimi-k2-thinking-model#multi-step-tool-call
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:075e239-pythonRun
All tags pushed for this build
About Multi-Architecture Support
075e239-python) is a multi-arch manifest supporting both amd64 and arm64075e239-python-amd64) are also available if needed