-
Notifications
You must be signed in to change notification settings - Fork 661
[Feature] support reward model #5301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…into support_pooling_5
…into support_pooling_5
…into support_pooling_5
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5301 +/- ##
==========================================
Coverage ? 59.75%
==========================================
Files ? 324
Lines ? 39721
Branches ? 5979
==========================================
Hits ? 23736
Misses ? 14111
Partials ? 1874
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None | ||
|
|
||
| # --8<-- [start:chat-embedding-extra-params] | ||
| add_generation_prompt: bool = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看以前的代码也合入了一些类似的调试注释信息,后面可以一起清理下
docs/features/pooling_models.md
Outdated
| |------------|--------------|---------------|---------| | ||
| | `embed` | `LAST` | ✅︎ | ❌ | | ||
|
|
||
| ## Offline Inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是online?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/zh/features/pooling_models.md
Outdated
|
|
||
| #### Predefined models | ||
|
|
||
| 如果模型定义的[Pooler][fastdeploy.model_executor.layers.pooler.Pooler]接受pooler_config,你可以通过--pooler_config覆盖部分属性。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里链接是不是没生效
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
docs/zh/features/pooling_models.md
Outdated
| |------------|--------------|---------------|---------| | ||
| | `embed` | `LAST` | ✅︎ | ❌ | | ||
|
|
||
| 加载`Sentence Transformers`模型时,其modules.json配置优于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文文档似乎没有这个内容,再对照下两个文档下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for reward models to FastDeploy, extending the pooling model framework to handle reward scoring in addition to embeddings. The implementation introduces a new reward API endpoint and includes comprehensive test coverage with baseline comparison for accuracy validation.
Key Changes
- Added reward model category to the model system with proper pooling support
- Implemented reward-specific pooling logic with prefix caching support
- Created new
/v1/rewardAPI endpoint for reward scoring requests - Added comprehensive documentation in both English and Chinese
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pooling/test_Qwen3-Embedding_serving.py | Changed from pytest.skip to FileNotFoundError for missing models |
| tests/pooling/test_Ernie4_5_reward_serving.py | New test suite for reward model serving with caching/non-caching scenarios |
| fastdeploy/worker/gpu_model_runner.py | Added pooling model handling for max_tokens and prefix caching support |
| fastdeploy/model_executor/pre_and_post_process.py | Added null check for pooler outputs in stream transfer |
| fastdeploy/model_executor/models/model_base.py | Extended is_pooling flag to include reward models |
| fastdeploy/model_executor/models/ernie_vl_rm.py | Implemented reward model with LAST pooling and proper pooler dispatching |
| fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py | Added float32 norm dtype support for reward models |
| fastdeploy/model_executor/models/adapters.py | Added reward pooler to adapter initialization |
| fastdeploy/model_executor/layers/pooler.py | Added Pooler.for_reward factory method and LastPool support for reward task |
| fastdeploy/model_executor/layers/pool/metadata.py | Improved device type handling with proper type hints |
| fastdeploy/entrypoints/openai/serving_engine.py | Removed add_generation_prompt from chat template kwargs |
| fastdeploy/entrypoints/openai/protocol.py | Cleaned up reward request/response protocol definitions |
| fastdeploy/engine/request.py | Disabled thinking mode when pooling_params is present |
| fastdeploy/engine/pooling_params.py | Changed default normalize to False for reward task |
| fastdeploy/config.py | Fixed num_hidden_layers override to skip pooling runner |
| docs/zh/features/pooling_models.md | Added Chinese documentation for pooling and reward models |
| docs/features/pooling_models.md | Added English documentation for pooling and reward models |
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None | ||
| pooler_output.append(output) | ||
| else: | ||
| current_seq_len_decoder = seq_lens_decoder_batch[i] | ||
| if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len): | ||
| output = raw_pooler_output[0].data |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic always accesses raw_pooler_output[0] regardless of the loop index i. This appears incorrect - it should access raw_pooler_output[i] to get the output corresponding to the current request being processed.
Consider changing:
output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else Noneto:
output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else NoneAnd similarly on line 2414.
| output = raw_pooler_output[0].data if int(seq_len) == int(prompt_len) else None | |
| pooler_output.append(output) | |
| else: | |
| current_seq_len_decoder = seq_lens_decoder_batch[i] | |
| if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len): | |
| output = raw_pooler_output[0].data | |
| output = raw_pooler_output[i].data if int(seq_len) == int(prompt_len) else None | |
| pooler_output.append(output) | |
| else: | |
| current_seq_len_decoder = seq_lens_decoder_batch[i] | |
| if int(current_seq_len_decoder) + int(seq_len) == int(prompt_len): | |
| output = raw_pooler_output[i].data |
| 3.CLSPool(PoolingType.CLS) | ||
|
|
||
| 作用:返回每个序列的第一个token(CLS token)的隐藏状态 | ||
|
|
||
| 4.MeanPool(PoolingType.MEAN) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent formatting: Items 1 and 2 use "1." and "2." numbering, while the pattern should continue with "3." and "4." for consistency. The formatting shows "3.CLSPool" and "4.MeanPool" without proper spacing.
| 3.CLSPool(PoolingType.CLS) | |
| 作用:返回每个序列的第一个token(CLS token)的隐藏状态 | |
| 4.MeanPool(PoolingType.MEAN) | |
| 3. CLSPool(PoolingType.CLS) | |
| 作用:返回每个序列的第一个token(CLS token)的隐藏状态 | |
| 4. MeanPool(PoolingType.MEAN) |
| FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口 | ||
|
|
||
| - `Embeddings API`,支持文本和多模态输入 | ||
| - `Reward API`,给指定的内容打分 |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar and formatting issue: "API的端点" should be "API端点" (redundant 的). Also, the list should use proper bullet points format with dashes, and there should be proper spacing after colons in Chinese text.
| _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log") | ||
|
|
||
|
|
||
| @pytest.fixture(scope="function") | ||
| def server_no_caching(): | ||
| _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log") |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixture server_default_caching starts the server but doesn't yield the process or perform cleanup. The server process should be yielded and properly terminated in a teardown section to avoid leaving orphaned processes.
Consider:
@pytest.fixture(scope="function")
def server_default_caching():
process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log")
yield process
try:
os.killpg(process.pid, signal.SIGTERM)
except Exception:
pass
clean_ports()| _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log") | |
| @pytest.fixture(scope="function") | |
| def server_no_caching(): | |
| _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log") | |
| process = _start_server_process(enable_caching=True, log_filename="reward_server_caching_on.log") | |
| try: | |
| yield process | |
| finally: | |
| try: | |
| os.killpg(process.pid, signal.SIGTERM) | |
| except Exception: | |
| pass | |
| clean_ports() | |
| @pytest.fixture(scope="function") | |
| def server_no_caching(): | |
| process = _start_server_process(enable_caching=False, log_filename="reward_server_caching_off.log") | |
| try: | |
| yield process | |
| finally: | |
| try: | |
| os.killpg(process.pid, signal.SIGTERM) | |
| except Exception: | |
| pass | |
| clean_ports() |
| def forward( | ||
| self, | ||
| hidden_states: Union[paddle.Tensor, list[paddle.Tensor]], | ||
| hidden_states: paddle.Tensor | list[paddle.Tensor], |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation uses the newer | syntax (paddle.Tensor | list[paddle.Tensor]), but earlier in the same file (e.g., line 435) uses the older Union syntax. This is inconsistent. Consider using Union[paddle.Tensor, list[paddle.Tensor]] for consistency with the rest of the codebase.
| hidden_states: paddle.Tensor | list[paddle.Tensor], | |
| hidden_states: Union[paddle.Tensor, list[paddle.Tensor]], |
| !!! 提示<br> | ||
| 你可以显示设置`--convert <type>`来制定模型转换方式。 |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The admonition syntax is incorrect. The exclamation marks should not have a space between them and "提示". It should be !!! 提示 without a trailing <br> tag.
| !!! 提示<br> | ||
| 在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。 |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The admonition syntax is incorrect (should be !!! 提示 without <br>), and "Fastdeploy" should be "FastDeploy" for consistent capitalization. Also, there's an extra comma before "因为".
| !!! 提示<br> | |
| 在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。 | |
| !!! 提示 | |
| 在绝大多数情况下无需手动设置该选项,因此FastDeploy可以通过--runner auto(默认值)自动检测合适的runner。 |
|
|
||
| | Architecture | `--convert` | 支持的池化类型 | | ||
| |-------------------------------------------------|-------------|---------------------------------------| | ||
| | `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` | |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma between *Model and **ForProcessRewardModel. It should be a comma-separated list. Also, **ForProcessRewardModel should be *ForProcessRewardModel (single asterisk for consistency).
| | `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` | | |
| | `*ForTextEncoding`, `*EmbeddingModel`, `*Model`, `*ForProcessRewardModel` | `embed` | `embed` | |
| chat_template_kwargs.update( | ||
| { | ||
| "chat_template": request_dict.get("chat_template"), |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The add_generation_prompt parameter has been removed from the chat_template_kwargs dictionary. However, this change doesn't check if the parameter is still needed elsewhere or provide any fallback. If this parameter was being used by chat templates, this could break existing functionality. Please ensure that removing this parameter doesn't break any existing chat template implementations or add a comment explaining why it's safe to remove.
| print("[Server Setup] Server failed to start. Cleaning up...") | ||
| try: | ||
| os.killpg(process.pid, signal.SIGTERM) | ||
| except Exception: |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
fastdeploy/engine/request.py
Outdated
| if pooling_params is not None: | ||
| self.enable_thinking = False | ||
| else: | ||
| self.enable_thinking = enable_thinking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个逻辑感觉非常偏模型定制化,建议模型里面改chat template来解决,不用手动hardcode代码
fastdeploy/engine/request.py
Outdated
| enable_thinking = d.get("enable_thinking", None) | ||
|
|
||
| if pooling_params is not None: | ||
| enable_thinking = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable_thinking这个参数的设定还存在一定疑问,不是所有模型都用了enable_thinking作为chat template里面来控制是否开关思考的变量,目前新版本都是通过chat_template_kwargs传入进去了。 后面也可能会修改
| if self.is_pooling_model: | ||
| rope_3d_position_ids["max_tokens_lst"].append(0) | ||
| else: | ||
| rope_3d_position_ids["max_tokens_lst"].append(request.get("max_tokens", 2048)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request会存在没有max_tokens的情况下吗,这里写入默认值担心引入以后排查问题的难度
…into support_pooling_6
Motivation
支持reward模型
Modifications
支持reward模型
Usage or Command
服务启动方式:
请求方式:
此pr仅支持单batch推理。
todo:支持embedding模型多batch推理
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.