Skip to content

Conversation

@xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 11, 2025

Fix max output error we see here:
#1016 (comment)

Fix prompt caching stats:

image

Fix extended thinking which require us to send back reasoning content: https://platform.moonshot.ai/docs/guide/use-kimi-k2-thinking-model#multi-step-tool-call


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:075e239-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-075e239-python \
  ghcr.io/openhands/agent-server:075e239-python

All tags pushed for this build

ghcr.io/openhands/agent-server:075e239-golang-amd64
ghcr.io/openhands/agent-server:075e239-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:075e239-golang-arm64
ghcr.io/openhands/agent-server:075e239-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:075e239-java-amd64
ghcr.io/openhands/agent-server:075e239-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:075e239-java-arm64
ghcr.io/openhands/agent-server:075e239-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:075e239-python-amd64
ghcr.io/openhands/agent-server:075e239-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:075e239-python-arm64
ghcr.io/openhands/agent-server:075e239-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:075e239-golang
ghcr.io/openhands/agent-server:075e239-java
ghcr.io/openhands/agent-server:075e239-python

About Multi-Architecture Support

  • Each variant tag (e.g., 075e239-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 075e239-python-amd64) are also available if needed

@xingyaoww xingyaoww requested a review from enyst November 11, 2025 16:17
@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Nov 11, 2025
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but could we create a test to check to make sure that reasoning content is indeed included?

- Test that kimi-k2-thinking model has send_reasoning_content=True
- Test that reasoning_content is included when send_reasoning_content is True
- Test that reasoning_content is excluded when send_reasoning_content is False
- Test edge cases: None, empty string, and list serializer scenarios

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 90.6%
Total Cost: $1.27
Models Tested: 4
Timestamp: 2025-11-11 16:30:56 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Success Rate Tests Passed Skipped Total Tests Cost
litellm_proxy_moonshot_kimi_k2_thinking 87.5% 7/8 1 8 $0.86
litellm_proxy_gpt_5_mini_2025_08_07 87.5% 7/8 0 8 $0.04
litellm_proxy_deepseek_deepseek_chat 87.5% 7/8 1 8 $0.02
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.35

📋 Detailed Results

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.86
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_8971752_kimi_k2_run_N8_20251111_161810
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5_mini_2025_08_07

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.04
  • Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_8971752_gpt5_mini_run_N8_20251111_161809

Failed Tests:

  • t08_image_file_viewing: Agent did not identify yellow color in the logo. Response: i can view the image, but i can’t display images inline here. please either (1) allow me to analyze the image pixel colors and report a precise breakdown, or (2) tell me which level of detail you want (dominant color, top three colors, or exact rgb counts). which would you like? (Cost: $0.0013)

litellm_proxy_deepseek_deepseek_chat

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.02
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_8971752_deepseek_run_N8_20251111_161803
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.35
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_8971752_sonnet_run_N8_20251111_161810

)

temperature: float | None = Field(default=0.0, ge=0)
temperature: float | None = Field(default=1.0, ge=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this in their communications, but I... don't think 1 as default works so well for us, for the majority of LLMs?

Since it has reasoning, maybe we could use this code or around there:

Copy link
Collaborator

@enyst enyst Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit ugly bit of code, but at least the comment says it all: "Reasoning-model quirks" 😅

Copy link
Collaborator Author

@xingyaoww xingyaoww Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit annoying because kimi-k2 doesn't support reasoning effort so we can't re-use the same path there 😅 ...

The only way around would be possibly hard-code it in the code (ugly), so go with set temperature to 1 for all models..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I see! One thought there, we eval at 0, and the user will not get the same experience we evaled for... well, by default in the SDK

I... actually suggest to hardcode it, I read the code again and we also check for mode name with "azure" (of course...), and for "mistral" and more in that method

Not a strong preference of course, but I kinda want to know why a thing happened, and having them all in one place helps with that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands we're getting too many hardcoded cases in the method linked above, in chat_options.py. Think deeply and come up with 3 suggestions how we can improve this.

Do NOT modify this PR, just think and make a new issue with your proposals.

Updated test expectations to match the new default temperature of 1.0:
- test_llm_config_defaults: Update default temperature assertion
- test_llm_local_detection_based_on_model_name: Update temperature check
- test_no_response_retry_bumps_temperature: Explicitly set temperature=0.0
  in fixture to properly test the temperature bump behavior on retry

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 11, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py44217660%316, 320, 325, 329, 333, 337–339, 343–344, 355–356, 358–359, 363, 380, 409, 414, 418, 423, 443, 472, 493, 497, 512, 518–519, 538–539, 549, 574–579, 600–601, 604, 608, 620, 625–628, 635, 638, 646–651, 655–658, 660, 673, 677–679, 681–682, 687–688, 690, 697, 700–705, 773–776, 781–785, 787–792, 794–796, 805–807, 810–811, 857–858, 861–864, 905, 919, 969, 972–974, 977–985, 989–991, 994, 997–999, 1006–1007, 1016, 1023–1025, 1029, 1031–1036, 1038–1055, 1058–1062, 1064–1065, 1071–1080, 1093, 1107, 1112
   message.py27512056%43, 48, 50, 67–69, 71–74, 76, 97, 99, 104, 177, 181, 198–203, 253, 260, 263, 280, 297, 303, 306, 309, 323–324, 340, 361, 366–368, 374–375, 381, 383, 393, 402–408, 422, 424–425, 427–434, 437, 450, 452, 454–459, 467–468, 470–471, 481–482, 486–490, 493–496, 498–499, 502, 507–509, 516, 518, 535, 549, 565, 590–592, 594–595, 599–602, 606–608, 611–615, 617–619, 621, 629–630, 647–648
openhands-sdk/openhands/sdk/llm/utils
   model_features.py32196%139
   telemetry.py1365758%87, 93, 102, 106, 109, 114–115, 150, 154, 161, 183, 187–188, 196–198, 204, 221–223, 225–227, 229, 237–238, 242–244, 247–250, 255, 260, 263, 268, 271, 277, 283, 285, 288–289, 294, 299–303, 310–311, 314, 316, 319–322
TOTAL11982553553% 

@xingyaoww
Copy link
Collaborator Author

@OpenHands please fix tests

k/openhands/sdk/workspace/remote/remote_workspace_mixin.py 105 83 21% 31-32, 36-39, 60-144, 168-205, 228-263, 286-295, 313-321
openhands-sdk/openhands/sdk/workspace/workspace.py 11 3 73% 43-49

TOTAL 6715 4260 37%
=========================== short test summary info ============================
FAILED tests/sdk/config/test_llm_config.py::test_llm_config_defaults
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_should_condense
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_condense_returns_view_when_no_condensation_needed
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_condense_returns_condensation_when_needed
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_get_condensation_with_previous_summary
FAILED tests/sdk/context/condenser/test_llm_summarizing_condenser.py::test_invalid_config
FAILED tests/sdk/llm/test_llm.py::test_llm_local_detection_based_on_model_name
================== 7 failed, 1123 passed in 150.43s (0:02:30) ==================

Do not change implementation. Also add new tests for the default temperature function that's just added

@openhands-ai
Copy link

openhands-ai bot commented Nov 11, 2025

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 11, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run tests

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1133 at branch `xw/fix-kimi`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Update test_llm_config_defaults to expect temperature=0.0
- Update test_llm_local_detection_based_on_model_name to expect temperature=0.0
- Fix mock_llm fixture in condenser tests to include temperature attribute
- Add comprehensive tests for get_default_temperature function:
  * Test kimi-k2-thinking models default to 1.0
  * Test all other models default to 0.0
  * Test with provider prefixes (moonshot/, litellm_proxy/)
  * Test case insensitivity
  * Test fallback for unknown models

All previously failing tests now pass.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Nov 11, 2025

✅ Tests Fixed Successfully

I've successfully fixed all 7 failing tests and added comprehensive new tests for the get_default_temperature function. Here's what was done:

Test Fixes (No Implementation Changes)

  1. test_llm_config_defaults - Updated expected temperature from 1.0 to 0.0
  2. test_llm_local_detection_based_on_model_name - Updated expected temperature from 1.0 to 0.0
  3. test_llm_summarizing_condenser.py (5 tests) - Added mock_llm.temperature = 0.0 attribute to the mock fixture

New Tests Added for get_default_temperature

Added 24 new test cases in tests/sdk/llm/test_model_features.py:

  • test_get_default_temperature (22 parametrized cases):

    • ✅ kimi-k2-thinking models default to 1.0
    • ✅ All other models default to 0.0
    • ✅ Provider prefixes (moonshot/, litellm_proxy/, azure/, openai/, anthropic/)
    • ✅ Case insensitivity
    • ✅ Various model families (GPT, Claude, Gemini, O1, Deepseek, Llama)
  • test_get_default_temperature_fallback - Tests unknown models return 0.0

  • test_get_default_temperature_case_insensitive - Additional case insensitivity tests

Verification

✅ All 7 originally failing tests now pass
✅ All 24 new tests pass
✅ Pre-commit hooks pass (formatting, linting, type checking)
✅ Changes committed and pushed to branch xw/fix-kimi

The PR is now ready with all tests passing!

View full conversation

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Nov 11, 2025
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 93.8%
Total Cost: $0.59
Models Tested: 4
Timestamp: 2025-11-11 17:20:46 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Success Rate Tests Passed Skipped Total Tests Cost
litellm_proxy_deepseek_deepseek_chat 87.5% 7/8 1 8 $0.03
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 8/8 0 8 $0.38
litellm_proxy_gpt_5_mini_2025_08_07 100.0% 8/8 0 8 $0.04
litellm_proxy_moonshot_kimi_k2_thinking 87.5% 7/8 1 8 $0.14

📋 Detailed Results

litellm_proxy_deepseek_deepseek_chat

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.03
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_e158c9b_deepseek_run_N8_20251111_171825
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.38
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e158c9b_sonnet_run_N8_20251111_171824

litellm_proxy_gpt_5_mini_2025_08_07

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.04
  • Run Suffix: litellm_proxy_gpt_5_mini_2025_08_07_e158c9b_gpt5_mini_run_N8_20251111_171826

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.14
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e158c9b_kimi_k2_run_N8_20251111_171825
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@xingyaoww xingyaoww enabled auto-merge (squash) November 11, 2025 17:23
@xingyaoww xingyaoww merged commit a4f97bd into main Nov 11, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants