Skip to content

Conversation

@lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Nov 30, 2025

Motivation

支持reward模型

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

支持reward模型

Usage or Command

服务启动方式:

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-num-seqs 256 \
    --max-model-len 32768 \
    --port 13351 \
    --engine-worker-queue-port 7562 \
    --metrics-port 7531 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --quantization "wint8" \
    --runner pooling \
    --convert embed

请求方式:

curl --location 'http://0.0.0.0:13351/v1/reward' \
--header 'Content-Type: application/json' \
--data-raw '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在哪里?"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在中国北京故宫的前面。"
                }
            ]
        }
    ],
    "user": "user-123"
}'

此pr仅支持单batch推理。
todo:支持embedding模型多batch推理

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 30, 2025

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Nov 30, 2025

Codecov Report

❌ Patch coverage is 77.77778% with 12 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@68533eb). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/layers/pooler.py 66.66% 3 Missing ⚠️
fastdeploy/model_executor/layers/pool/metadata.py 71.42% 1 Missing and 1 partial ⚠️
fastdeploy/model_executor/models/ernie_vl_rm.py 66.66% 1 Missing and 1 partial ⚠️
fastdeploy/model_executor/pre_and_post_process.py 0.00% 1 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py 84.61% 1 Missing and 1 partial ⚠️
fastdeploy/engine/pooling_params.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5301   +/-   ##
==========================================
  Coverage           ?   59.75%           
==========================================
  Files              ?      324           
  Lines              ?    39721           
  Branches           ?     5979           
==========================================
  Hits               ?    23736           
  Misses             ?    14111           
  Partials           ?     1874           
Flag Coverage Δ
GPU 59.75% <77.77%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None

# --8<-- [start:chat-embedding-extra-params]
add_generation_prompt: bool = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看以前的代码也合入了一些类似的调试注释信息,后面可以一起清理下

|------------|--------------|---------------|---------|
| `embed` | `LAST` | ✅︎ ||

## Offline Inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是online?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### Predefined models

如果模型定义的[Pooler][fastdeploy.model_executor.layers.pooler.Pooler]接受pooler_config,你可以通过--pooler_config覆盖部分属性。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里链接是不是没生效

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

|------------|--------------|---------------|---------|
| `embed` | `LAST` | ✅︎ ||

加载`Sentence Transformers`模型时,其modules.json配置优于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

英文文档似乎没有这个内容,再对照下两个文档下吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for reward models to FastDeploy, extending the pooling model framework to handle reward scoring in addition to embeddings. The implementation introduces a new reward API endpoint and includes comprehensive test coverage with baseline comparison for accuracy validation.

Key Changes

  • Added reward model category to the model system with proper pooling support
  • Implemented reward-specific pooling logic with prefix caching support
  • Created new /v1/reward API endpoint for reward scoring requests
  • Added comprehensive documentation in both English and Chinese

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
tests/pooling/test_Qwen3-Embedding_serving.py Changed from pytest.skip to FileNotFoundError for missing models
tests/pooling/test_Ernie4_5_reward_serving.py New test suite for reward model serving with caching/non-caching scenarios
fastdeploy/worker/gpu_model_runner.py Added pooling model handling for max_tokens and prefix caching support
fastdeploy/model_executor/pre_and_post_process.py Added null check for pooler outputs in stream transfer
fastdeploy/model_executor/models/model_base.py Extended is_pooling flag to include reward models
fastdeploy/model_executor/models/ernie_vl_rm.py Implemented reward model with LAST pooling and proper pooler dispatching
fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py Added float32 norm dtype support for reward models
fastdeploy/model_executor/models/adapters.py Added reward pooler to adapter initialization
fastdeploy/model_executor/layers/pooler.py Added Pooler.for_reward factory method and LastPool support for reward task
fastdeploy/model_executor/layers/pool/metadata.py Improved device type handling with proper type hints
fastdeploy/entrypoints/openai/serving_engine.py Removed add_generation_prompt from chat template kwargs
fastdeploy/entrypoints/openai/protocol.py Cleaned up reward request/response protocol definitions
fastdeploy/engine/request.py Disabled thinking mode when pooling_params is present
fastdeploy/engine/pooling_params.py Changed default normalize to False for reward task
fastdeploy/config.py Fixed num_hidden_layers override to skip pooling runner
docs/zh/features/pooling_models.md Added Chinese documentation for pooling and reward models
docs/features/pooling_models.md Added English documentation for pooling and reward models

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Comment on lines 2409 to 2414
output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None
pooler_output.append(output)
else:
current_seq_len_decoder = seq_lens_decoder_batch[i]
if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):
output = raw_pooler_output[0].data
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic always accesses raw_pooler_output[0] regardless of the loop index i. This appears incorrect - it should access raw_pooler_output[i] to get the output corresponding to the current request being processed.

Consider changing:

output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None

to:

output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else None

And similarly on line 2414.

Suggested change
output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None
pooler_output.append(output)
else:
current_seq_len_decoder = seq_lens_decoder_batch[i]
if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):
output = raw_pooler_output[0].data
output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else None
pooler_output.append(output)
else:
current_seq_len_decoder = seq_lens_decoder_batch[i]
if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):
output = raw_pooler_output[i].data

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +61
3.CLSPool(PoolingType.CLS)

作用:返回每个序列的第一个token(CLS token)的隐藏状态

4.MeanPool(PoolingType.MEAN)
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent formatting: Items 1 and 2 use "1." and "2." numbering, while the pattern should continue with "3." and "4." for consistency. The formatting shows "3.CLSPool" and "4.MeanPool" without proper spacing.

Suggested change
3.CLSPool(PoolingType.CLS)
作用:返回每个序列的第一个token(CLS token)的隐藏状态
4.MeanPool(PoolingType.MEAN)
3. CLSPool(PoolingType.CLS)
作用:返回每个序列的第一个token(CLS token)的隐藏状态
4. MeanPool(PoolingType.MEAN)

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +70
FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口

- `Embeddings API`,支持文本和多模态输入
- `Reward API`,给指定的内容打分
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar and formatting issue: "API的端点" should be "API端点" (redundant 的). Also, the list should use proper bullet points format with dashes, and there should be proper spacing after colons in Chinese text.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +132
_start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")


@pytest.fixture(scope="function")
def server_no_caching():
_start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixture server_default_caching starts the server but doesn't yield the process or perform cleanup. The server process should be yielded and properly terminated in a teardown section to avoid leaving orphaned processes.

Consider:

@pytest.fixture(scope="function")
def server_default_caching():
    process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")
    yield process
    try:
        os.killpg(process.pid, signal.SIGTERM)
    except Exception:
        pass
    clean_ports()
Suggested change
_start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")
@pytest.fixture(scope="function")
def server_no_caching():
_start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")
process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")
try:
yield process
finally:
try:
os.killpg(process.pid, signal.SIGTERM)
except Exception:
pass
clean_ports()
@pytest.fixture(scope="function")
def server_no_caching():
process = _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")
try:
yield process
finally:
try:
os.killpg(process.pid, signal.SIGTERM)
except Exception:
pass
clean_ports()

Copilot uses AI. Check for mistakes.
def forward(
self,
hidden_states: Union[paddle.Tensor, list[paddle.Tensor]],
hidden_states: paddle.Tensor | list[paddle.Tensor],
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation uses the newer | syntax (paddle.Tensor | list[paddle.Tensor]), but earlier in the same file (e.g., line 435) uses the older Union syntax. This is inconsistent. Consider using Union[paddle.Tensor, list[paddle.Tensor]] for consistency with the rest of the codebase.

Suggested change
hidden_states: paddle.Tensor | list[paddle.Tensor],
hidden_states: Union[paddle.Tensor, list[paddle.Tensor]],

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +29
!!! 提示<br>
你可以显示设置`--convert <type>`来制定模型转换方式。
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admonition syntax is incorrect. The exclamation marks should not have a space between them and "提示". It should be !!! 提示 without a trailing <br> tag.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +16
!!! 提示<br>
在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admonition syntax is incorrect (should be !!! 提示 without <br>), and "Fastdeploy" should be "FastDeploy" for consistent capitalization. Also, there's an extra comma before "因为".

Suggested change
!!! 提示<br>
在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
!!! 提示
在绝大多数情况下无需手动设置该选项,因此FastDeploy可以通过--runner auto(默认值)自动检测合适的runner。

Copilot uses AI. Check for mistakes.

| Architecture | `--convert` | 支持的池化类型 |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma between *Model and **ForProcessRewardModel. It should be a comma-separated list. Also, **ForProcessRewardModel should be *ForProcessRewardModel (single asterisk for consistency).

Suggested change
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model`, `*ForProcessRewardModel` | `embed` | `embed` |

Copilot uses AI. Check for mistakes.
Comment on lines 256 to 258
chat_template_kwargs.update(
{
"chat_template": request_dict.get("chat_template"),
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_generation_prompt parameter has been removed from the chat_template_kwargs dictionary. However, this change doesn't check if the parameter is still needed elsewhere or provide any fallback. If this parameter was being used by chat templates, this could break existing functionality. Please ensure that removing this parameter doesn't break any existing chat template implementations or add a comment explaining why it's safe to remove.

Copilot uses AI. Check for mistakes.
print("[Server Setup] Server failed to start. Cleaning up...")
try:
os.killpg(process.pid, signal.SIGTERM)
except Exception:
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
if pooling_params is not None:
self.enable_thinking = False
else:
self.enable_thinking = enable_thinking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个逻辑感觉非常偏模型定制化,建议模型里面改chat template来解决,不用手动hardcode代码

enable_thinking = d.get("enable_thinking", None)

if pooling_params is not None:
enable_thinking = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Collaborator

@Jiang-Jia-Jun Jiang-Jia-Jun Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_thinking这个参数的设定还存在一定疑问,不是所有模型都用了enable_thinking作为chat template里面来控制是否开关思考的变量,目前新版本都是通过chat_template_kwargs传入进去了。 后面也可能会修改

if self.is_pooling_model:
rope_3d_position_ids["max_tokens_lst"].append(0)
else:
rope_3d_position_ids["max_tokens_lst"].append(request.get("max_tokens", 2048))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request会存在没有max_tokens的情况下吗,这里写入默认值担心引入以后排查问题的难度

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants