Skip to content

Conversation

@bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Nov 14, 2025

Purpose

This fixes multiple issues with how we were prompting gpt-oss models for Chat Completion requests. A summary of the fixes:

  • We were not setting the recipient to "assistant" on tool responses
  • We were not mapping the tool_call_id to the proper function name on tool responses, resulting in the model getting confused about which tool response matched which tool call.
  • When the model generates tool calls during a turn, we return any model output to the commentary channel as the content of that tool call response. We were not properly restoring this content as a commentary message when those tool calls were passed back in via Chat Completion messages.
  • When the model generates reasoning messages during a tool call turn, we return that as a reasoning properly on the Chat Completion response. We were not properly restoring that reasoning content as a message to the analysis channel when subsequent turns passed that back in.
  • When non-tool reasoning responses from the model were passed back in via the reasoning extra_body in subsequent turns, we were not properly turning those back into analysis messages. This is the same problem we had with reasoning in tool calls, but a slightly different fix for the non-tool call path.
  • We were not properly marking the channel as final for non-tool, non-reasoning model output when generating the Harmony messages for subsequent turns.
  • We were not dropping analysis messages that precede a model output to the final channel in all cases, resulting in us often showing the model analysis messages from previous turns where it has already generated a final output leading to increased token usage, context pollution, and not following the best practices for handling raw chain of thought outlined by OpenAI.

I added a series of tests that pass in Chat Completion requests and thoroughly validate the exact Harmony messages generated from those requests to test for all of the cases above.

Test Plan

Unit Tests

pytest tests/entrypoints/openai/test_serving_chat.py::TestServingChatWithHarmony

pytest tests/entrypoints/test_harmony_utils.py

Berkeley Function Calling Leaderboard v4 Results

Run vLLM built with this fix:

vllm serve openai/gpt-oss-120b \
  --tool-call-parser openai \
  --reasoning-parser openai_gptoss \
  --enable-auto-tool-choice \
  --tensor-parallel-size 4

Run bfcl generate against that vLLM:

# Assumes you're inside a local clone of bfcl and adjusted it to use openai/gpt-oss-120b model
cat <<EOF >> bfcl_eval/constants/model_config.py
MODEL_CONFIG_MAPPING = {
    "openai/gpt-oss-120b": ModelConfig(
        model_name="openai/gpt-oss-120b",
        display_name="openai/gpt-oss-120b (FC) (vLLM)",
        url="https://huggingface.co/openai/gpt-oss-120b",
        org="OpenAI",
        license="apache-2.0",
        model_handler=OpenAICompletionsHandler,
        input_price=None,
        output_price=None,
        is_fc_model=True,
        underscore_to_dot=True,
    ),
}
EOF

OPENAI_BASE_URL="http://localhost:8000/v1" \
OPENAI_API_KEY="fake" \
bfcl generate \
  --model openai/gpt-oss-120b \
  --test-category multi_turn \
  --num-threads 8

Test Result

Unit Tests

pytest -q --disable-warnings \
  tests/entrypoints/openai/test_serving_chat.py::TestServingChatWithHarmony
...........                                                       [100%]
11 passed, 2 warnings in 12.20s

pytest -q --disable-warnings \
  tests/entrypoints/test_harmony_utils.py
.......................................................        [100%]
55 passed, 2 warnings in 3.83s

Berkeley Function Calling Leaderboard v4 Results

Baseline, from vLLM main as of Nov 12, 2025

commit 3044195

45.0% overall accuracy

🔍 Running test: multi_turn_base
✅ Test completed: multi_turn_base. 🎯 Accuracy: 50.00%
🔍 Running test: multi_turn_long_context
✅ Test completed: multi_turn_long_context. 🎯 Accuracy: 43.00%
🔍 Running test: multi_turn_miss_func
✅ Test completed: multi_turn_miss_func. 🎯 Accuracy: 45.50%
🔍 Running test: multi_turn_miss_param
✅ Test completed: multi_turn_miss_param. 🎯 Accuracy: 41.50%

$ cat score-baseline/data_multi_turn.csv 
Rank,Model,Multi Turn Overall Acc,Base,Miss Func,Miss Param,Long Context
1,openai/gpt-oss-120b (FC) (vLLM),45.00%,50.00%,45.50%,41.50%,43.00%

With this fix on top of same commit from main

  • 53.5% overall accuracy, an improvement of 8.5%
  • 12% improvement in multi_turn_base
  • 8.5% improvement in multi_turn_long_context
  • 4.5% improvement in multi_turn_miss_func
  • 8.0% improvement in multi_turn_miss_param
✅ Test completed: multi_turn_base. 🎯 Accuracy: 62.00%                                                                                                                                                                                                                                         
🔍 Running test: multi_turn_long_context                                                                                                                                                                                                                                                        
✅ Test completed: multi_turn_long_context. 🎯 Accuracy: 51.50%                                                                                                                                                                                                                                 
🔍 Running test: multi_turn_miss_func                                                                                                                                                                                                                                                           
✅ Test completed: multi_turn_miss_func. 🎯 Accuracy: 50.00%                                                                                                                                                                                                                                    
🔍 Running test: multi_turn_miss_param                                                                                                                                                                                                                                                          
✅ Test completed: multi_turn_miss_param. 🎯 Accuracy: 49.50%

$ cat score-harmony_fixes/data_multi_turn.csv 
Rank,Model,Multi Turn Overall Acc,Base,Miss Func,Miss Param,Long Context
1,openai/gpt-oss-120b (FC) (vLLM),53.25%,62.00%,50.00%,49.50%,51.50%

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of fixes for prompting gpt-oss models for Chat Completion, addressing several issues related to tool calls, reasoning, and multi-turn conversations. The changes significantly improve the model's performance on function calling benchmarks. The addition of a thorough test suite to validate these fixes is a great contribution. The code is well-structured, and the fixes in vllm/entrypoints/harmony_utils.py are robust. I have one suggestion to improve the code by removing a side effect, which will enhance maintainability.

@bbrowning
Copy link
Contributor Author

Before this fix, when running the bfcl multi_turn and single_turn suits (4,441 tests) I received 8 HarmonyErrors of unexpected tokens remaining in message header and 1 HarmonyError of Unknown role: final. After this fix, I received zero HarmonyErrors on the same test suites.

So, I can also confidently say that by fixing these issues in how we were prompting the gpt-oss-120b model, we are able to reduce the frequency that it does not follow its Harmony format in its responses.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ehfd
Copy link

ehfd commented Nov 15, 2025

Are we able to get this merged before v0.11.1? Looks very critical.

@ZJY0516
Copy link
Contributor

ZJY0516 commented Nov 15, 2025

CC @chaunceyjiang

@groundsada
Copy link

Can this be merged please 🙏?

@bbrowning
Copy link
Contributor Author

I'm going to make some minor changes to address the review feedback from the two code review bots and to isolate this change entirely from the Responses API path so that we can merge this quicker without worry of it impacting Responses handling in any way. I'll get that done, re-tested, and push the updates today.

@bbrowning
Copy link
Contributor Author

Are we able to get this merged before v0.11.1? Looks very critical.

Note that even without this change, the gpt-oss support on latest merged main branch is substantially better than what v0.11.0 or v0.10.2 shipped with. I don't know the timing or cutoff for fixes in v0.11.1, but even if this does not make that release there will still be measurably better gpt-oss support in v0.11.1. With that said, I support getting this in if the window for additional fixes is still open for v0.11.1.

@bbrowning
Copy link
Contributor Author

Ok, from my perspective I've addressed the review bot comments, entirely isolated these changes to only impact the Chat Completions path, expanded the existing test_harmony_utils.py test suite so that it tests both Chat Completions and Responses paths with a common set of tests while moving logic specific to each into its own test class, and re-run my BFCL evaluation on gpt-oss-120b with the latest changes to ensure the numbers are approximately equal as before (approximately because there is a bit of run-to-run variance in this test suite as even though the inputs are identical in every test run the outputs are not entirely deterministic).

I'll keep an eye out for any other review feedback or comments required to get this in.

@ehfd
Copy link

ehfd commented Nov 16, 2025

While I agree that tool calling and various other quirks in GPT-OSS have improved from basically unusable to a usable state before this PR, this should be designated as a milestone for v0.11.1.

@heheda12345
Copy link
Collaborator

@chaunceyjiang
Copy link
Collaborator

/cc @yeqcharlotte PTAL.

Copy link
Contributor

@alecsolder alecsolder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I think we're finally landing on an implementation that takes the learnings from all of us over the past few months :)

for item in content
if isinstance(item, dict) and item.get("type") == "text"
)
if content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this case for? If tool calls is populated, I can understand reasoning being populated, but not this (which would imply it would be more like a "final" message)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this content closer matches the final content, but now you made me double-check how we generate the outputs and we have some issues in mapping of reasoning and tool call outputs when generating Chat Completion responses. In my head, the content here matched commentary channel responses from tool calls while the reasoning key matches analysis channel responses. However, that's not necessarily the case and it looks like there are some edge cases in our reasoning and tool call parsers for gpt-oss models that can sometimes end up with this, sometimes end up with analysis or commentary messages missed entirely, or most often end up with analyis messages as reasoning and final messages as the tool call content.

In practice, this still seems to work quite well as-is. But, we can likely pick up a bit more accuracy by doing a once-over of Chat Completions reasoning/tool parsers and chat_completion_*_generator paths for harmony models to ensure streaming/non-streaming cases are all aligned with exactly how we parse input. I probably need a new test suite here that starts with Harmony messages, goes through our response parsing to Chat Completion responses, sends those back in as Chat Completion requests, and ensures the Harmony messages generated from that match what we originally sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of what worries me here is on the response generation side, where it makes assumptions about which messages to map as reasoning vs content solely based on order of the output messages without checking of any channels:

if len(output_msgs) == 0:
# The generation has stopped during reasoning.
reasoning = parser.current_content
final_content = None
elif len(output_msgs) == 1:
# The generation has stopped during final message.
reasoning = output_msgs[0].content[0].text
final_content = parser.current_content
else:
reasoning_msg = output_msgs[:-1]
final_msg = output_msgs[-1]
reasoning = "\n".join([msg.content[0].text for msg in reasoning_msg])
final_content = final_msg.content[0].text

Then, we overwrite he content from that if using a tool call parser at

tool_call_info = tool_parser.extract_tool_calls(
"",
request=request,
token_ids=token_ids, # type: ignore
)
content = tool_call_info.content
message = ChatMessage(
role=role,
reasoning=reasoning,
content=content,
tool_calls=tool_call_info.tool_calls,
)

The tool call parser at https://github.com/vllm-project/vllm/blob/63fed5550609b96b578d2512aefced09efe76e1e/vllm/entrypoints/openai/tool_parsers/openai_tool_parser.py only looks at harmony messages with function recipients or to the final channel.

So, we may end up mixing analysis and commentary channel messages in reasoning or even dropping one of those entirely when making the Chat Completion response. To get this request mapping to Harmony messages exactly right, I'll need to clean up and pair this with some adjustments to the Harmony message to response mapping path so I can do this deterministically.

That will take a bit of time to match up, add some round-trip tests to ensure it doesn't deviate in the future, and re-run the eval. I can tackle this next week, either as part of this or a folllow-up based on timing preferences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there were some bugs on the Chat Completion response generation side around what I outlined above. I vastly expanded the new tests in test_serving_chat.py to verify various round-trips of Chat Completion requests to Harmony messages and Harmony outputs to Chat Completion responses including multi-turn variants that then send those Chat Completion response messages back in to future Chat Completion requests.

analysis_msg = Message.from_role_and_content(
Role.ASSISTANT, reasoning_content
)
analysis_msg = analysis_msg.with_channel("analysis")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs counter to how I feel like it should work, but I think is correct.

Trying to understand OpenAI's implementation, when reasoning is output to the commentary channel, that information is meant to signal that this reasoning should be shown to the user as-is (vs needing a summarization) i.e they want to provide the actual reasoning for the tool call so the user can approve it or not vs only a summarization. But then when it is provided as input again, it doesn't matter so they just always change it back to analysis so that it can be dropped.

msg = Message.from_role_and_content(Role.ASSISTANT, arguments)
msg = msg.with_channel("commentary")
msg = msg.with_recipient(f"functions.{name}")
msg = msg.with_content_type("json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use a comment as it differs from the OpenAI implementation, however I think there has been enough proof at this point that it increases benchmark scores so should be kept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to this effect, but also indicating it may need further evaluation.

if not msg.channel:
msg = msg.with_channel("final")
msgs.append(msg)
# For user/system/developer messages, add them directly even if no content.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think system and developer cases here need more complex logic.

I think from a high level:

  • We should never have more than 1 system message
  • We should never have more than 1 developer message

(Can this validation be added to a test?)

Since we create system message and developer message every time in _make_request_with_harmony I'm worried about what happens with system messages here.

Since there is no "instructions" field like in responses, we need to merge the developer message created with the included tools with the developer message here, but not duplicating the tools for example. Complicated and definitely worth the tests I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not addressed this yet. This behavior should be unchanged from how it was before my PR, and I'm tempted to defer addressing this specific system/developer concern as this one is already large.

@bbRLdev
Copy link

bbRLdev commented Nov 19, 2025

Some >30% of the time, the model stops generating right after "reasoning_content"/"reasoning" is finished outputting with stop_reason 200012 (200012 corresponds to the <|call|> token). No final answer in the "content" field is generated. This behavior is incompatible for most clients, i.e. Cline. Out of curiosity, will this PR fix said issue?

This is prevalent in vLLM 0.11.1 (from my observation, 0.11.1 may have made it occur more frequently now) and 0.11.0

@ehfd
Copy link

ehfd commented Nov 19, 2025

Some >30% of the time, the model stops generating right after "reasoning_content"/"reasoning" is finished outputting with stop_reason 200012 (200012 corresponds to the <|call|> token). No final answer in the "content" field is generated. This behavior is incompatible for most clients, i.e. Cline. Out of curiosity, will this PR fix said issue?

This is prevalent in vLLM 0.11.1 (from my observation, 0.11.1 may have made it occur more frequently now) and 0.11.0

@bbRLdev Yes, I think that it is plausible that the discussions and fixes here would help. It seems that very few self-hosted LLM runtimes have actually mastered OpenAI's Harmony format correctly, yet.

@bbrowning
Copy link
Contributor Author

Some more manual testing and close examination of the exact Harmony inputs and outputs from a few failed BFCL multi_turn test cases help me squash some additional small bugs in how we converted Harmony outputs to Chat Completion responses, specifically around some invalid assumptions for which messages are reasoning and which are not from the parse_chat_output method. I've fixed those and added lots of unit tests. In particular, there's now unit test suites for both streaming and non-streaming paths of serving_chat.py where we test conversion of Chat Completion requests to Harmony messages, provide some mocked Harmony output and test how that gets converted to Chat Completion responses, and then send those response messages back in as inputs through multiple turns of tool calling and reasoning.

I feel this is in a good place and will kick off another full BFCL multi_turn run with gpt-oss-120b based on the latest commit here on two different generations of hardware (L40S and DGX Spark) to ensure everything looks good. I've run various individual tests while working so don't expect any regressions there.

@bbrowning
Copy link
Contributor Author

Some >30% of the time, the model stops generating right after "reasoning_content"/"reasoning" is finished outputting with stop_reason 200012 (200012 corresponds to the <|call|> token). No final answer in the "content" field is generated. This behavior is incompatible for most clients, i.e. Cline. Out of curiosity, will this PR fix said issue?

This is prevalent in vLLM 0.11.1 (from my observation, 0.11.1 may have made it occur more frequently now) and 0.11.0

So, it is expected for generation to stop after the <|call|> token because that is a stop token. However, that stop token only gets emitted after a tool call and should not be happening immediately after reasoning. You won't necessarily get "content" with a tool call though, so it would be normal and expected to just get reasoning output and the actual tool call and then for generation to stop. The client is then expected to execute the tool call and provide the tool's output as a message from the tool role in a subsequent turn. I expect Cline knows how to do all this as it's standard Chat Completions stuff, but I don't personally use Cline.

If you are using the Chat Completions API (as opposed to Responses API) to query gpt-oss models, then this could fix your issue. I'm happy to test directly against what you're seeing if you have a semi-reliable way to reproduce it. Or, I encourage you to test against this PR if you're able to see how it impacts what you're seeing.

@bbrowning
Copy link
Contributor Author

I reran the same BFCL multi_turn suite on the same machine I reported these results on earlier (4xL40S) and things are holding steady. This latest round of fixes didn't meaningfully improve the results in that particular eval suite, but that suite doesn't use streaming requests so the parity between non-streaming and streaming now in how they convert to/from Chat Completion and Harmony messages would not be reflected. It does give me confidence that the latest changes don't regress anything in multi_turn scenarios compared to my previously tested changes.

@bbRLdev
Copy link

bbRLdev commented Nov 21, 2025

Some >30% of the time, the model stops generating right after "reasoning_content"/"reasoning" is finished outputting with stop_reason 200012 (200012 corresponds to the <|call|> token). No final answer in the "content" field is generated. This behavior is incompatible for most clients, i.e. Cline. Out of curiosity, will this PR fix said issue?
This is prevalent in vLLM 0.11.1 (from my observation, 0.11.1 may have made it occur more frequently now) and 0.11.0

So, it is expected for generation to stop after the <|call|> token because that is a stop token. However, that stop token only gets emitted after a tool call and should not be happening immediately after reasoning. You won't necessarily get "content" with a tool call though, so it would be normal and expected to just get reasoning output and the actual tool call and then for generation to stop. The client is then expected to execute the tool call and provide the tool's output as a message from the tool role in a subsequent turn. I expect Cline knows how to do all this as it's standard Chat Completions stuff, but I don't personally use Cline.

@bbrowning So I was able to pull this change and test it myself with GPT OSS 120b + Cline. The model no longer skips the final answer and is not generating tools right after reasoning. This PR took GPT OSS 120b from being nearly unusable with Cline to being very much usable. So thank you for your work.

If you are using the Chat Completions API (as opposed to Responses API) to query gpt-oss models, then this could fix your issue. I'm happy to test directly against what you're seeing if you have a semi-reliable way to reproduce it. Or, I encourage you to test against this PR if you're able to see how it impacts what you're seeing.

If you copy and paste the prompt Cline produces for your initial task and feed it directly to vLLM v0.11.1 in a simple request body only containing model and messages, you can reproduce it.

@fabienric
Copy link

fabienric commented Nov 26, 2025

Hi @bbrowning any idea when your PR and #29236 can be merged? I tried your PR (great work btw) but I still reproduce the issue on this test payload (which was shared here: #22519 (comment)):

{
  "max_completion_tokens": 2048,
  "messages": [
    {
      "content": "\n        # ROLE\n        You are a retrieval query optimizer for a Retrieval-Augmented Generation (RAG) system.\n        This query rewriting takes place in the context of a \"Panoramic\" RAG system, so you must take into account the project context to understand the user's request.\n\n        # OBJECTIVE\n        Convert the user’s request into a concise, keyword-rich, and context-preserving query that will maximize the relevance of retrieved documents.\n\n        # REALITY FILTER\n        - Do not add facts not given by the user\n        - Do not speculate about details that are not in the query\n\n        # INSTRUCTIONS\n        1. Identify the core informational target of the user’s query.\n        2. Identify the project context of the user's query.\n        3. Remove polite filler, vague phrasing, or irrelevant context.\n        4. Preserve important entities, dates, technical terms, and constraints.\n        5. If useful, rephrase into a canonical form that matches how data might be stored.\n        6. Output **only** the optimized retrieval query, with no commentary.\n\n        # PROJECT CONTEXT\n        \n        ",
      "role": "system"
    },
    {
      "content": "\n        # CONVERSATION\n        \n\n        # OUTPUT\n        User: \"What did the president of cor say ?\"\n        Assistant: \n        ",
      "role": "user"
    }
  ],
  "model": "gpt-oss-120b",
  "stream": false,
  "temperature": 0.1
}

@iSevenDays
Copy link

@bbrowning thanks for the MR! Is there a way to make it work in codex? Currently, it stops after first tool execution.
Cline seems to have issues initially (Invalid API Response), but after 2-3 requests it continues to work without issues, which is great.

My codex config.toml for codex includes vllm profile

[profiles.vllm]
model_provider = "vllm"
model = "openai/gpt-oss-120b"

[model_providers.vllm]
name = "vllm"
base_url = "http://192.168.0.24:8080/v1"
api_key = "sk-no-key-required"
timeout = 300
model_reasoning_effort = "high"
model_context_window = 131072
model = "openai/gpt-oss-120b"
wire_api = "chat"

@iSevenDays
Copy link

I found one bug in vllm

curl -s -X POST http://192.168.0.24:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{\"model\": \"openai/gpt-oss-120b\", \"messages\": [{\"role\": \"user\", \"content\": \"Say hello\"}], \"max_tokens\": 50}' | jq .

"message": {
  "role": "assistant",
  "content": null,   // ← NULL, not empty string!
  "tool_calls": [],
  "reasoning": "The user says \"Say hello\"..."
}

the content was in reasoning and codex doesn't handle that

@xxsLuna
Copy link

xxsLuna commented Nov 28, 2025

vLLM_serve-1  | (APIServer pid=1) INFO:     192.168.100.17:64030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281] Error in chat completion stream generator.
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281] Traceback (most recent call last):
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281]   File "/root/vllm_package/vllm/vllm/entrypoints/openai/serving_chat.py", line 725, in chat_completion_stream_generator
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281]     harmony_parser.process(token_id)
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281]   File "/usr/local/lib/python3.12/dist-packages/openai_harmony/__init__.py", line 637, in process
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281]     self._inner.process(token)
vLLM_serve-1  | (APIServer pid=1) ERROR 11-27 21:29:25 [serving_chat.py:1281] openai_harmony.HarmonyError: unexpected tokens remaining in message header: Some("need to get coordinates for Moscow, Shanghai, Sapporo, Los Angeles. Use get_direct_geocoding. For Moscow, maybe \"Moscow, Russia\". It failed due to 'ko'? maybe language? Let's try \"Moscow, Russia\" again.<|end|><|start|>assistant<|channel|>commentary")

Thanks for your previous response. I followed the link you provided, switched to the chat-harmony-tests branch, built it, and tried the process again, but the same error as before is still occurring.
Is there anything else I can test on my side?

@bbrowning
Copy link
Contributor Author

I found one bug in vllm

curl -s -X POST http://192.168.0.24:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{\"model\": \"openai/gpt-oss-120b\", \"messages\": [{\"role\": \"user\", \"content\": \"Say hello\"}], \"max_tokens\": 50}' | jq .
"message": {
  "role": "assistant",
  "content": null,   // ← NULL, not empty string!
  "tool_calls": [],
  "reasoning": "The user says \"Say hello\"..."
}

the content was in reasoning and codex doesn't handle that

This isn't a bug though, is it? The model generated reasoning text, but hit its max_tokens so it had to stop before it had a chance to generate the user-visible content. So, all you got output was reasoning text. I see the same locally, but this would be expected behavior because generation stopped before any non-reasoning content was ever generated.

If you'd like to see the content as well, you'll need to raise max_tokens to a value high enough that allows the model to generate both reasoning and content. In your simple example, raising this from 50 to 500 worked for me.

This fixes multiple issues with how we were prompting gpt-oss models for
Chat Completion requests. A summary of the fixes:

- We were not setting the recipient to "assistant" on tool responses
- We were not mapping the tool_call_id to the proper function name on
  tool responses, resulting in the model getting confused about which
  tool response matched which tool call.
- When the model generates tool calls during a turn, we return any model
  output to the commentary channel as the `content` of that tool call
  response. We were not properly restoring this content as a commentary
  message when those tool calls were passed back in via Chat Completion
  messages.
- When the model generates reasoning messages during a tool call turn,
  we return that as a `reasoning` properly on the Chat Completion
  response. We were not properly restoring that reasoning content as a
  message to the analysis channel when subsequent turns passed that back
  in.
- When non-tool reasoning responses from the model were passed back in
  via the `reasoning` extra_body in subsequent turns, we were not
  properly turning those back into analysis messages. This is the same
  problem we had with reasoning in tool calls, but a slightly different
  fix for the non-tool call path.
- We were not properly marking the channel as `final` for non-tool,
  non-reasoning model output when generating the Harmony messages for
  subsequent turns.
- We were not dropping analysis messages that precede a model output to
  the `final` channel in all cases, resulting in us often showing the
  model analysis messages from previous turns where it has already
  generated a final output leading to increased token usage, context
  pollution, and not following the best practices for handling raw chain
  of thought outlined by OpenAI.

I added a series of tests that pass in Chat Completion requests and
thoroughly validate the exact Harmony messages generated from those
requests to test for all of the cases above.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Instead of doing my own workaround for Pydantic's poor handling of
Iterable fields, re-use the existing maybe_serialize_tool_calls method
that the MistralTokenizer work added to turn these ValidatorIterator
instances into actual lists.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
This isolates the fixes made for turning Chat Completion Requests into
Harmony Messages from the code path that Responses uses for
previous_input_messages so that Chat Completions support can be iterated
and improved without worrying about regressing Responses previous input
handling.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
This addresses a few minor points of review feedback from my larger set
of fixes for Chat Completions and gpt-oss models:

- In `auto_drop_analysis_messages`, drop any analysis messages before
  the latest final message instead of just ones from assistant.
- Cleaned up `parse_chat_output` so that is is checking message channel
  before assigning messages as reasoning or final content instead of
  relying solely on the order of messages and assumptions about the
  number of messages present in each response that were incorrect.
- Some misc. typos, extracting copy/paste code into methods, etc.

This also adds many tests, as I found and fixed issues raised during
review feedback and manual testing. There are now round-trip tests for
converting Chat Completion requests to Harmony messages and from
converting Harmony output to Chat Completion responses, all with mocked
generation so they're part of the unit test suite. These test in both
streaming and non-streaming mode to ensure parity there.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning
Copy link
Contributor Author

bbrowning commented Dec 3, 2025

I rebased this to fix conflicts and adapt to the processor -> input_processor change in serving_engine.py. All the tests continue to pass and this still meaningfully improves gpt-oss model results on multi-turn benchmarks.

From my perspective, this is ready to merge and provides much needed fixes to running these models via Chat Completions.

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 4, 2025
@mergify
Copy link

mergify bot commented Dec 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bbrowning.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models needs-rebase ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: No status
Status: To Triage

Development

Successfully merging this pull request may close these issues.