Support kimi-k2 extended thinking, fix prompt caching stats, fix max output #1133

xingyaoww · 2025-11-11T16:17:27Z

Fix max output error we see here:
#1016 (comment)

Fix prompt caching stats:

Fix extended thinking which require us to send back reasoning content: https://platform.moonshot.ai/docs/guide/use-kimi-k2-thinking-model#multi-step-tool-call

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:075e239-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-075e239-python \
  ghcr.io/openhands/agent-server:075e239-python

All tags pushed for this build

ghcr.io/openhands/agent-server:075e239-golang-amd64
ghcr.io/openhands/agent-server:075e239-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:075e239-golang-arm64
ghcr.io/openhands/agent-server:075e239-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:075e239-java-amd64
ghcr.io/openhands/agent-server:075e239-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:075e239-java-arm64
ghcr.io/openhands/agent-server:075e239-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:075e239-python-amd64
ghcr.io/openhands/agent-server:075e239-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:075e239-python-arm64
ghcr.io/openhands/agent-server:075e239-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:075e239-golang
ghcr.io/openhands/agent-server:075e239-java
ghcr.io/openhands/agent-server:075e239-python

About Multi-Architecture Support

Each variant tag (e.g., 075e239-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 075e239-python-amd64) are also available if needed

…output

github-actions · 2025-11-11T16:17:43Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

neubig

LGTM but could we create a test to check to make sure that reasoning content is indeed included?

openhands-sdk/openhands/sdk/llm/message.py

- Test that kimi-k2-thinking model has send_reasoning_content=True - Test that reasoning_content is included when send_reasoning_content is True - Test that reasoning_content is excluded when send_reasoning_content is False - Test edge cases: None, empty string, and list serializer scenarios Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-11-11T16:30:58Z

🧪 Integration Tests Results

Overall Success Rate: 90.6%
Total Cost: $1.27
Models Tested: 4
Timestamp: 2025-11-11 16:30:56 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_gpt_5_mini_2025_08_07: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs

📊 Summary

Model	Success Rate	Tests Passed	Skipped	Total Tests	Cost
litellm_proxy_moonshot_kimi_k2_thinking	87.5%	7/8	1	8	$0.86
litellm_proxy_gpt_5_mini_2025_08_07	87.5%	7/8	0	8	$0.04
litellm_proxy_deepseek_deepseek_chat	87.5%	7/8	1	8	$0.02
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.35

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 87.5% (7/8)
Total Cost: $0.86
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_8971752_kimi_k2_run_N8_20251111_161810
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

Success Rate: 87.5% (7/8)
Total Cost: $0.04
Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_8971752_gpt5_mini_run_N8_20251111_161809

Failed Tests:

t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: i can view the image, but i can’t display images inline here. please either (1) allow me to analyze the image pixel colors and report a precise breakdown, or (2) tell me which level of detail you want (dominant color, top three colors, or exact rgb counts). which would you like? (Cost: $0.0013)

litellm_proxy_deepseek_deepseek_chat

Success Rate: 87.5% (7/8)
Total Cost: $0.02
Run Suffix: litellm_proxy_deepseek_deepseek_chat_8971752_deepseek_run_N8_20251111_161803
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.35
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_8971752_sonnet_run_N8_20251111_161810

enyst · 2025-11-11T16:31:25Z

openhands-sdk/openhands/sdk/llm/llm.py

    )

-    temperature: float | None = Field(default=0.0, ge=0)
+    temperature: float | None = Field(default=1.0, ge=0)


I saw this in their communications, but I... don't think 1 as default works so well for us, for the majority of LLMs?

Since it has reasoning, maybe we could use this code or around there:

software-agent-sdk/openhands-sdk/openhands/sdk/llm/options/chat_options.py

Line 41 in 488806e

out.pop("temperature", None)

It's a bit ugly bit of code, but at least the comment says it all: "Reasoning-model quirks" 😅

it is a bit annoying because kimi-k2 doesn't support reasoning effort so we can't re-use the same path there 😅 ...

The only way around would be possibly hard-code it in the code (ugly), so go with set temperature to 1 for all models..

Oof, I see! One thought there, we eval at 0, and the user will not get the same experience we evaled for... well, by default in the SDK

I... actually suggest to hardcode it, I read the code again and we also check for mode name with "azure" (of course...), and for "mistral" and more in that method

Not a strong preference of course, but I kinda want to know why a thing happened, and having them all in one place helps with that

@OpenHands we're getting too many hardcoded cases in the method linked above, in chat_options.py. Think deeply and come up with 3 suggestions how we can improve this.

Do NOT modify this PR, just think and make a new issue with your proposals.

Updated test expectations to match the new default temperature of 1.0: - test_llm_config_defaults: Update default temperature assertion - test_llm_local_detection_based_on_model_name: Update temperature check - test_no_response_retry_bumps_temperature: Explicitly set temperature=0.0 in fixture to properly test the temperature bump behavior on retry Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2025-11-11T16:39:24Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm
llm.py	442	176	60%	316, 320, 325, 329, 333, 337–339, 343–344, 355–356, 358–359, 363, 380, 409, 414, 418, 423, 443, 472, 493, 497, 512, 518–519, 538–539, 549, 574–579, 600–601, 604, 608, 620, 625–628, 635, 638, 646–651, 655–658, 660, 673, 677–679, 681–682, 687–688, 690, 697, 700–705, 773–776, 781–785, 787–792, 794–796, 805–807, 810–811, 857–858, 861–864, 905, 919, 969, 972–974, 977–985, 989–991, 994, 997–999, 1006–1007, 1016, 1023–1025, 1029, 1031–1036, 1038–1055, 1058–1062, 1064–1065, 1071–1080, 1093, 1107, 1112
message.py	275	120	56%	43, 48, 50, 67–69, 71–74, 76, 97, 99, 104, 177, 181, 198–203, 253, 260, 263, 280, 297, 303, 306, 309, 323–324, 340, 361, 366–368, 374–375, 381, 383, 393, 402–408, 422, 424–425, 427–434, 437, 450, 452, 454–459, 467–468, 470–471, 481–482, 486–490, 493–496, 498–499, 502, 507–509, 516, 518, 535, 549, 565, 590–592, 594–595, 599–602, 606–608, 611–615, 617–619, 621, 629–630, 647–648
openhands-sdk/openhands/sdk/llm/utils
model_features.py	32	1	96%	139
telemetry.py	136	57	58%	87, 93, 102, 106, 109, 114–115, 150, 154, 161, 183, 187–188, 196–198, 204, 221–223, 225–227, 229, 237–238, 242–244, 247–250, 255, 260, 263, 268, 271, 277, 283, 285, 288–289, 294, 299–303, 310–311, 314, 316, 319–322
TOTAL	11982	5535	53%

openhands-sdk/openhands/sdk/llm/utils/model_features.py

xingyaoww · 2025-11-11T16:58:59Z

@OpenHands please fix tests

k/openhands/sdk/workspace/remote/remote_workspace_mixin.py 105 83 21% 31-32, 36-39, 60-144, 168-205, 228-263, 286-295, 313-321
openhands-sdk/openhands/sdk/workspace/workspace.py 11 3 73% 43-49

TOTAL 6715 4260 37%
=========================== short test summary info ============================
FAILED tests/sdk/config/test_llm_config.py::test_llm_config_defaults
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_should_condense
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_condense_returns_view_when_no_condensation_needed
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_condense_returns_condensation_when_needed
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_get_condensation_with_previous_summary
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_invalid_config
FAILED tests/sdk/llm/test_llm.py::test_llm_local_detection_based_on_model_name
================== 7 failed, 1123 passed in 150.43s (0:02:30) ==================

Do not change implementation. Also add new tests for the default temperature function that's just added

openhands-ai · 2025-11-11T16:59:11Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2025-11-11T17:01:11Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1133 at branch `xw/fix-kimi`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

- Update test_llm_config_defaults to expect temperature=0.0 - Update test_llm_local_detection_based_on_model_name to expect temperature=0.0 - Fix mock_llm fixture in condenser tests to include temperature attribute - Add comprehensive tests for get_default_temperature function: * Test kimi-k2-thinking models default to 1.0 * Test all other models default to 0.0 * Test with provider prefixes (moonshot/, litellm_proxy/) * Test case insensitivity * Test fallback for unknown models All previously failing tests now pass. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-11-11T17:06:45Z

✅ Tests Fixed Successfully

I've successfully fixed all 7 failing tests and added comprehensive new tests for the get_default_temperature function. Here's what was done:

Test Fixes (No Implementation Changes)

test_llm_config_defaults - Updated expected temperature from 1.0 to 0.0
test_llm_local_detection_based_on_model_name - Updated expected temperature from 1.0 to 0.0
test_llm_summarizing_condenser.py (5 tests) - Added mock_llm.temperature = 0.0 attribute to the mock fixture

New Tests Added for `get_default_temperature`

Added 24 new test cases in tests/sdk/llm/test_model_features.py:

test_get_default_temperature (22 parametrized cases):
- ✅ kimi-k2-thinking models default to 1.0
- ✅ All other models default to 0.0
- ✅ Provider prefixes (moonshot/, litellm_proxy/, azure/, openai/, anthropic/)
- ✅ Case insensitivity
- ✅ Various model families (GPT, Claude, Gemini, O1, Deepseek, Llama)
test_get_default_temperature_fallback - Tests unknown models return 0.0
test_get_default_temperature_case_insensitive - Additional case insensitivity tests

Verification

✅ All 7 originally failing tests now pass
✅ All 24 new tests pass
✅ Pre-commit hooks pass (formatting, linting, type checking)
✅ Changes committed and pushed to branch xw/fix-kimi

The PR is now ready with all tests passing!

_{View full conversation}

github-actions · 2025-11-11T17:17:58Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2025-11-11T17:20:48Z

🧪 Integration Tests Results

Overall Success Rate: 93.8%
Total Cost: $0.59
Models Tested: 4
Timestamp: 2025-11-11 17:20:46 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_gpt_5_mini_2025_08_07: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs

📊 Summary

Model	Success Rate	Tests Passed	Skipped	Total Tests	Cost
litellm_proxy_deepseek_deepseek_chat	87.5%	7/8	1	8	$0.03
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	8/8	0	8	$0.38
litellm_proxy_gpt_5_mini_2025_08_07	100.0%	8/8	0	8	$0.04
litellm_proxy_moonshot_kimi_k2_thinking	87.5%	7/8	1	8	$0.14

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

Success Rate: 87.5% (7/8)
Total Cost: $0.03
Run Suffix: litellm_proxy_deepseek_deepseek_chat_e158c9b_deepseek_run_N8_20251111_171825
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

Success Rate: 100.0% (8/8)
Total Cost: $0.38
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e158c9b_sonnet_run_N8_20251111_171824

litellm_proxy_gpt_5_mini_2025_08_07

Success Rate: 100.0% (8/8)
Total Cost: $0.04
Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_e158c9b_gpt5_mini_run_N8_20251111_171826

litellm_proxy_moonshot_kimi_k2_thinking

Success Rate: 87.5% (7/8)
Total Cost: $0.14
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e158c9b_kimi_k2_run_N8_20251111_171825
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

openhands-sdk/openhands/sdk/llm/utils/model_features.py

support kimi-k2 extended thinking, fix prompt caching stats, fix max …

8971752

…output

xingyaoww requested a review from enyst November 11, 2025 16:17

xingyaoww added the integration-test Runs the integration tests and comments the results label Nov 11, 2025

neubig reviewed Nov 11, 2025

View reviewed changes

xingyaoww commented Nov 11, 2025

View reviewed changes

openhands-sdk/openhands/sdk/llm/message.py Show resolved Hide resolved

enyst reviewed Nov 11, 2025

View reviewed changes

xingyaoww added 3 commits November 11, 2025 11:42

revert temperature change

2a854e5

simplify verified model

ac82e13

set model-specific default temperature

391cc5b

xingyaoww commented Nov 11, 2025

View reviewed changes

openhands-sdk/openhands/sdk/llm/utils/model_features.py Outdated Show resolved Hide resolved

Apply suggestion from @xingyaoww

868e4be

neubig approved these changes Nov 11, 2025

View reviewed changes

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Nov 11, 2025

xingyaoww commented Nov 11, 2025

View reviewed changes

openhands-sdk/openhands/sdk/llm/utils/model_features.py Outdated Show resolved Hide resolved

Apply suggestion from @xingyaoww

eb6e797

xingyaoww enabled auto-merge (squash) November 11, 2025 17:23

xingyaoww merged commit a4f97bd into main Nov 11, 2025
15 checks passed

xingyaoww deleted the xw/fix-kimi branch November 11, 2025 17:26

xingyaoww mentioned this pull request Nov 11, 2025

Fix success rate calculation to exclude skipped tests #1136

Merged

Support kimi-k2 extended thinking, fix prompt caching stats, fix max output #1133

Support kimi-k2 extended thinking, fix prompt caching stats, fix max output #1133

Uh oh!

Conversation

xingyaoww commented Nov 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 11, 2025

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_gpt_5_mini_2025_08_07

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_claude_sonnet_4_5_20250929

Uh oh!

enyst Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enyst Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

enyst Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xingyaoww commented Nov 11, 2025

k/openhands/sdk/workspace/remote/remote_workspace_mixin.py 105 83 21% 31-32, 36-39, 60-144, 168-205, 228-263, 286-295, 313-321 openhands-sdk/openhands/sdk/workspace/workspace.py 11 3 73% 43-49

Uh oh!

openhands-ai bot commented Nov 11, 2025

Uh oh!

openhands-ai bot commented Nov 11, 2025

Uh oh!

openhands-ai bot commented Nov 11, 2025

✅ Tests Fixed Successfully

Test Fixes (No Implementation Changes)

New Tests Added for get_default_temperature

Verification

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 11, 2025

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_gpt_5_mini_2025_08_07

litellm_proxy_moonshot_kimi_k2_thinking

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xingyaoww commented Nov 11, 2025 •

edited by github-actions bot

Loading

enyst Nov 11, 2025 •

edited

Loading

xingyaoww Nov 11, 2025 •

edited

Loading

github-actions bot commented Nov 11, 2025 •

edited

Loading

k/openhands/sdk/workspace/remote/remote_workspace_mixin.py 105 83 21% 31-32, 36-39, 60-144, 168-205, 228-263, 286-295, 313-321
openhands-sdk/openhands/sdk/workspace/workspace.py 11 3 73% 43-49

New Tests Added for `get_default_temperature`