[Feature] support reward model #5301

lizexu123 · 2025-11-30T05:04:34Z

Motivation

支持reward模型

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

支持reward模型

Usage or Command

服务启动方式:

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-num-seqs 256 \
    --max-model-len 32768 \
    --port 13351 \
    --engine-worker-queue-port 7562 \
    --metrics-port 7531 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.9 \
    --quantization "wint8" \
    --runner pooling \
    --convert embed

请求方式:

curl --location 'http://0.0.0.0:13351/v1/reward' \
--header 'Content-Type: application/json' \
--data-raw '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在哪里？"
                }
            ]
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "北京天安门在中国北京故宫的前面。"
                }
            ]
        }
    ],
    "user": "user-123"
}'

此pr仅支持单batch推理。
todo:支持embedding模型多batch推理

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…into support_pooling_5

paddle-bot · 2025-11-30T05:04:42Z

Thanks for your contribution!

codecov-commenter · 2025-11-30T06:36:13Z

Codecov Report

❌ Patch coverage is 77.77778% with 12 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@68533eb). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/pooler.py	66.66%	3 Missing ⚠️
fastdeploy/model_executor/layers/pool/metadata.py	71.42%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/models/ernie_vl_rm.py	66.66%	1 Missing and 1 partial ⚠️
fastdeploy/model_executor/pre_and_post_process.py	0.00%	1 Missing and 1 partial ⚠️
fastdeploy/worker/gpu_model_runner.py	84.61%	1 Missing and 1 partial ⚠️
fastdeploy/engine/pooling_params.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5301   +/-   ##
==========================================
  Coverage           ?   59.75%           
==========================================
  Files              ?      324           
  Lines              ?    39721           
  Branches           ?     5979           
==========================================
  Hits               ?    23736           
  Misses             ?    14111           
  Partials           ?     1874

Flag	Coverage Δ
GPU	`59.75% <77.77%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zoooo0820 · 2025-12-01T06:43:34Z

fastdeploy/entrypoints/openai/protocol.py

    truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None

-    # --8<-- [start:chat-embedding-extra-params]
-    add_generation_prompt: bool = Field(


看以前的代码也合入了一些类似的调试注释信息，后面可以一起清理下

zoooo0820 · 2025-12-01T07:01:15Z

docs/features/pooling_models.md

+|------------|--------------|---------------|---------|
+| `embed`    | `LAST`       | ✅︎            | ❌      |
+
+## Offline Inference


应该是online?

zoooo0820 · 2025-12-01T07:01:34Z

docs/zh/features/pooling_models.md

+
+#### Predefined models
+
+如果模型定义的[Pooler][fastdeploy.model_executor.layers.pooler.Pooler]接受pooler_config，你可以通过--pooler_config覆盖部分属性。


这里链接是不是没生效

zoooo0820 · 2025-12-01T07:02:34Z

docs/zh/features/pooling_models.md

+|------------|--------------|---------------|---------|
+| `embed`    | `LAST`       | ✅︎            | ❌      |
+
+加载`Sentence Transformers`模型时，其modules.json配置优于默认值，也可以通过@default_pooling_type("LAST")在模型组网时指定。


英文文档似乎没有这个内容，再对照下两个文档下吧

Copilot

Pull request overview

This PR adds support for reward models to FastDeploy, extending the pooling model framework to handle reward scoring in addition to embeddings. The implementation introduces a new reward API endpoint and includes comprehensive test coverage with baseline comparison for accuracy validation.

Key Changes

Added reward model category to the model system with proper pooling support
Implemented reward-specific pooling logic with prefix caching support
Created new /v1/reward API endpoint for reward scoring requests
Added comprehensive documentation in both English and Chinese

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
tests/pooling/test_Qwen3-Embedding_serving.py	Changed from pytest.skip to FileNotFoundError for missing models
tests/pooling/test_Ernie4_5_reward_serving.py	New test suite for reward model serving with caching/non-caching scenarios
fastdeploy/worker/gpu_model_runner.py	Added pooling model handling for max_tokens and prefix caching support
fastdeploy/model_executor/pre_and_post_process.py	Added null check for pooler outputs in stream transfer
fastdeploy/model_executor/models/model_base.py	Extended is_pooling flag to include reward models
fastdeploy/model_executor/models/ernie_vl_rm.py	Implemented reward model with LAST pooling and proper pooler dispatching
fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py	Added float32 norm dtype support for reward models
fastdeploy/model_executor/models/adapters.py	Added reward pooler to adapter initialization
fastdeploy/model_executor/layers/pooler.py	Added Pooler.for_reward factory method and LastPool support for reward task
fastdeploy/model_executor/layers/pool/metadata.py	Improved device type handling with proper type hints
fastdeploy/entrypoints/openai/serving_engine.py	Removed add_generation_prompt from chat template kwargs
fastdeploy/entrypoints/openai/protocol.py	Cleaned up reward request/response protocol definitions
fastdeploy/engine/request.py	Disabled thinking mode when pooling_params is present
fastdeploy/engine/pooling_params.py	Changed default normalize to False for reward task
fastdeploy/config.py	Fixed num_hidden_layers override to skip pooling runner
docs/zh/features/pooling_models.md	Added Chinese documentation for pooling and reward models
docs/features/pooling_models.md	Added English documentation for pooling and reward models

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copilot · 2025-12-01T12:01:50Z

fastdeploy/worker/gpu_model_runner.py

+                output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None
+                pooler_output.append(output)
+            else:
+                current_seq_len_decoder = seq_lens_decoder_batch[i]
+                if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):
+                    output = raw_pooler_output[0].data


The logic always accesses raw_pooler_output[0] regardless of the loop index i. This appears incorrect - it should access raw_pooler_output[i] to get the output corresponding to the current request being processed.

Consider changing:

output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None

to:

output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else None

And similarly on line 2414.

Suggested change

output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None

pooler_output.append(output)

else:

current_seq_len_decoder = seq_lens_decoder_batch[i]

if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):

output = raw_pooler_output[0].data

output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else None

pooler_output.append(output)

else:

current_seq_len_decoder = seq_lens_decoder_batch[i]

if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len):

output = raw_pooler_output[i].data

Copilot · 2025-12-01T12:01:50Z

docs/zh/features/pooling_models.md

+3.CLSPool(PoolingType.CLS)
+
+作用:返回每个序列的第一个token(CLS token)的隐藏状态
+
+4.MeanPool(PoolingType.MEAN)


Inconsistent formatting: Items 1 and 2 use "1." and "2." numbering, while the pattern should continue with "3." and "4." for consistency. The formatting shows "3.CLSPool" and "4.MeanPool" without proper spacing.

Suggested change

3.CLSPool(PoolingType.CLS)

作用:返回每个序列的第一个token(CLS token)的隐藏状态

4.MeanPool(PoolingType.MEAN)

3. CLSPool(PoolingType.CLS)

作用:返回每个序列的第一个token(CLS token)的隐藏状态

4. MeanPool(PoolingType.MEAN)

Copilot · 2025-12-01T12:01:51Z

docs/zh/features/pooling_models.md

+FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口
+
+- `Embeddings API`，支持文本和多模态输入
+- `Reward API`,给指定的内容打分


Grammar and formatting issue: "API的端点" should be "API端点" (redundant 的). Also, the list should use proper bullet points format with dashes, and there should be proper spacing after colons in Chinese text.

Copilot · 2025-12-01T12:01:51Z

tests/pooling/test_Ernie4_5_reward_serving.py

+    _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")
+
+
+@pytest.fixture(scope="function")
+def server_no_caching():
+    _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")


The fixture server_default_caching starts the server but doesn't yield the process or perform cleanup. The server process should be yielded and properly terminated in a teardown section to avoid leaving orphaned processes.

Consider:

@pytest.fixture(scope="function") def server_default_caching(): process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log") yield process try: os.killpg(process.pid, signal.SIGTERM) except Exception: pass clean_ports()

Suggested change

_start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")

@pytest.fixture(scope="function")

def server_no_caching():

_start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")

process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")

try:

yield process

finally:

try:

os.killpg(process.pid, signal.SIGTERM)

except Exception:

pass

clean_ports()

@pytest.fixture(scope="function")

def server_no_caching():

process = _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log")

try:

yield process

finally:

try:

os.killpg(process.pid, signal.SIGTERM)

except Exception:

pass

clean_ports()

Copilot · 2025-12-01T12:01:51Z

fastdeploy/model_executor/layers/pooler.py

    def forward(
        self,
-        hidden_states: Union[paddle.Tensor, list[paddle.Tensor]],
+        hidden_states: paddle.Tensor | list[paddle.Tensor],


The type annotation uses the newer | syntax (paddle.Tensor | list[paddle.Tensor]), but earlier in the same file (e.g., line 435) uses the older Union syntax. This is inconsistent. Consider using Union[paddle.Tensor, list[paddle.Tensor]] for consistency with the rest of the codebase.

Suggested change

hidden_states: paddle.Tensor | list[paddle.Tensor],

hidden_states: Union[paddle.Tensor, list[paddle.Tensor]],

Copilot · 2025-12-01T12:01:56Z

docs/zh/features/pooling_models.md

+!!! 提示<br>
+    你可以显示设置`--convert <type>`来制定模型转换方式。


The admonition syntax is incorrect. The exclamation marks should not have a space between them and "提示". It should be !!! 提示 without a trailing <br> tag.

Copilot · 2025-12-01T12:01:56Z

docs/zh/features/pooling_models.md

+!!! 提示<br>
+    在绝大多数情况下无需手动设置该选项，因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。


The admonition syntax is incorrect (should be !!! 提示 without <br>), and "Fastdeploy" should be "FastDeploy" for consistent capitalization. Also, there's an extra comma before "因为".

Suggested change

!!! 提示<br>

在绝大多数情况下无需手动设置该选项，因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。

!!! 提示

在绝大多数情况下无需手动设置该选项，因此FastDeploy可以通过--runner auto(默认值)自动检测合适的runner。

Copilot · 2025-12-01T12:01:56Z

docs/zh/features/pooling_models.md

+
+| Architecture                                    | `--convert` | 支持的池化类型               |
+|-------------------------------------------------|-------------|---------------------------------------|
+| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel`  | `embed`     |  `embed`                              |


Missing comma between *Model and **ForProcessRewardModel. It should be a comma-separated list. Also, **ForProcessRewardModel should be *ForProcessRewardModel (single asterisk for consistency).

Suggested change

| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |

| `*ForTextEncoding`, `*EmbeddingModel`, `*Model`, `*ForProcessRewardModel` | `embed` | `embed` |

Copilot · 2025-12-01T12:01:57Z

fastdeploy/entrypoints/openai/serving_engine.py

        chat_template_kwargs.update(
            {
                "chat_template": request_dict.get("chat_template"),


The add_generation_prompt parameter has been removed from the chat_template_kwargs dictionary. However, this change doesn't check if the parameter is still needed elsewhere or provide any fallback. If this parameter was being used by chat templates, this could break existing functionality. Please ensure that removing this parameter doesn't break any existing chat template implementations or add a comment explaining why it's safe to remove.

Copilot · 2025-12-01T12:01:57Z

tests/pooling/test_Ernie4_5_reward_serving.py

+        print("[Server Setup] Server failed to start. Cleaning up...")
+        try:
+            os.killpg(process.pid, signal.SIGTERM)
+        except Exception:


'except' clause does nothing but pass and there is no explanatory comment.

Jiang-Jia-Jun · 2025-12-01T12:01:34Z

fastdeploy/engine/request.py

+        if pooling_params is not None:
+            self.enable_thinking = False
+        else:
+            self.enable_thinking = enable_thinking


这个逻辑感觉非常偏模型定制化，建议模型里面改chat template来解决，不用手动hardcode代码

Jiang-Jia-Jun · 2025-12-01T12:02:05Z

fastdeploy/engine/request.py

+        enable_thinking = d.get("enable_thinking", None)
+
+        if pooling_params is not None:
+            enable_thinking = False


enable_thinking这个参数的设定还存在一定疑问，不是所有模型都用了enable_thinking作为chat template里面来控制是否开关思考的变量，目前新版本都是通过chat_template_kwargs传入进去了。后面也可能会修改

Jiang-Jia-Jun · 2025-12-01T12:04:53Z

fastdeploy/worker/gpu_model_runner.py

+        if self.is_pooling_model:
+            rope_3d_position_ids["max_tokens_lst"].append(0)
+        else:
+            rope_3d_position_ids["max_tokens_lst"].append(request.get("max_tokens", 2048))


request会存在没有max_tokens的情况下吗，这里写入默认值担心引入以后排查问题的难度

…into support_pooling_6

lizexu123 added 23 commits November 25, 2025 15:17

Your commit message here

16b9399

add test

d8a0f52

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

1c2a228

…into support_pooling_5

update develop

63df44b

support reward

7cca89f

update develop

a0c81e3

support enable_chunk_prefill

af3b93b

support bingfa

fa45a91

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

766e58b

…into support_pooling_5

support convert is reward

8868d8e

update test

3a289c9

delete print

921e04d

fix enable_thinking

0a07749

add document

dd6cb23

fix place

3df2899

fix test

8023a66

fix

d0c4151

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

a775f9a

…into support_pooling_5

support enable_prefix_caching

ab0b6aa

add no-enable_prefix-caching test

b777746

fix

c83a9e0

support enable_prefix_caching

0751d21

delete print

0b692d0

fix document

6f1b431

lizexu123 added 2 commits November 30, 2025 12:12

fix

4d7f4fb

fix test

817ede4

zoooo0820 reviewed Dec 1, 2025

View reviewed changes

fix document and delete chinese

6f59f5b

Jiang-Jia-Jun requested a review from Copilot December 1, 2025 11:54

Copilot started reviewing on behalf of Jiang-Jia-Jun December 1, 2025 11:54 View session

Copilot finished reviewing on behalf of Jiang-Jia-Jun December 1, 2025 11:58

Copilot AI reviewed Dec 1, 2025

View reviewed changes

Jiang-Jia-Jun requested changes Dec 1, 2025

View reviewed changes

lizexu123 added 4 commits December 1, 2025 12:13

udpate

243e6c1

enable_thinking

87d4d45

fix test

4a8bf03

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

778feca

…into support_pooling_6


		#### Predefined models

		如果模型定义的[Pooler][fastdeploy.model_executor.layers.pooler.Pooler]接受pooler_config，你可以通过--pooler_config覆盖部分属性。

	hidden_states: paddle.Tensor \| list[paddle.Tensor],
	hidden_states: Union[paddle.Tensor, list[paddle.Tensor]],

		!!! 提示<br>
		你可以显示设置`--convert <type>`来制定模型转换方式。

		!!! 提示<br>
		在绝大多数情况下无需手动设置该选项，因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。

	\| `ForTextEncoding`, `EmbeddingModel`, `Model` `*ForProcessRewardModel` \| `embed` \| `embed` \|
	\| `ForTextEncoding`, `EmbeddingModel`, `Model`, `ForProcessRewardModel` \| `embed` \| `embed` \|

[Feature] support reward model #5301

Are you sure you want to change the base?

[Feature] support reward model #5301

Conversation

lizexu123 commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 30, 2025

Uh oh!

codecov-commenter commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jiang-Jia-Jun Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

lizexu123 commented Nov 30, 2025 •

edited

Loading

codecov-commenter commented Nov 30, 2025 •

edited

Loading

Jiang-Jia-Jun Dec 1, 2025 •

edited

Loading