Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
16b9399
Your commit message here
lizexu123 Nov 25, 2025
d8a0f52
add test
lizexu123 Nov 25, 2025
1c2a228
Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …
lizexu123 Nov 25, 2025
63df44b
update develop
lizexu123 Nov 25, 2025
7cca89f
support reward
lizexu123 Nov 25, 2025
a0c81e3
update develop
lizexu123 Nov 25, 2025
af3b93b
support enable_chunk_prefill
lizexu123 Nov 25, 2025
fa45a91
support bingfa
lizexu123 Nov 26, 2025
766e58b
Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …
lizexu123 Nov 27, 2025
8868d8e
support convert is reward
lizexu123 Nov 27, 2025
3a289c9
update test
lizexu123 Nov 27, 2025
921e04d
delete print
lizexu123 Nov 27, 2025
0a07749
fix enable_thinking
lizexu123 Nov 27, 2025
dd6cb23
add document
lizexu123 Nov 27, 2025
3df2899
fix place
lizexu123 Nov 27, 2025
8023a66
fix test
lizexu123 Nov 27, 2025
d0c4151
fix
lizexu123 Nov 27, 2025
a775f9a
Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …
lizexu123 Nov 27, 2025
ab0b6aa
support enable_prefix_caching
lizexu123 Nov 28, 2025
b777746
add no-enable_prefix-caching test
lizexu123 Nov 28, 2025
c83a9e0
fix
lizexu123 Nov 28, 2025
0751d21
support enable_prefix_caching
lizexu123 Nov 28, 2025
0b692d0
delete print
lizexu123 Nov 30, 2025
6f1b431
fix document
lizexu123 Nov 30, 2025
4d7f4fb
fix
lizexu123 Nov 30, 2025
817ede4
fix test
lizexu123 Dec 1, 2025
6f59f5b
fix document and delete chinese
lizexu123 Dec 1, 2025
243e6c1
udpate
lizexu123 Dec 1, 2025
87d4d45
enable_thinking
lizexu123 Dec 1, 2025
4a8bf03
fix test
lizexu123 Dec 2, 2025
778feca
Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …
lizexu123 Dec 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 175 additions & 0 deletions docs/features/pooling_models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
[简体中文](../zh/features//pooling_models.md)

# Pooling Models

FastDeploy also supports pooling models, such as embedding models.

In FastDeploy, pooling models implement the `FdModelForPooling` interface.
These models use a `Pooler` to extract the final hidden states of the input
before returning them.

## Configuration

### Model Runner

Run a model in pooling mode via the option `--runner pooling`.

!!! tip<br>
There is no need to set this option in the vast majority of cases as Fastdeploy can automatically
detect the appropriate model runner via `--runner auto`.

### Model Conversion

FastDeploy can adapt models for various pooling tasks via the option `--convert <type>`.

If `--runner pooling` has been set (manually or automatically) but the model does not implement the
`FdModelForPooling` interface,
vLLM will attempt to automatically convert the model according to the architecture names
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "vLLM" should be "FastDeploy" - this appears to be copy-paste error from another codebase.

Suggested change
vLLM will attempt to automatically convert the model according to the architecture names
FastDeploy will attempt to automatically convert the model according to the architecture names

Copilot uses AI. Check for mistakes.
shown in the table below.

| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `*ForProcessRewardModel` | `embed` | `embed` |
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma between *Model and *ForProcessRewardModel. It should be a comma-separated list. Also, *ForProcessRewardModel has an extra space before the closing backtick.

Suggested change
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `*ForProcessRewardModel` | `embed` | `embed` |
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model`, `*ForProcessRewardModel` | `embed` | `embed` |

Copilot uses AI. Check for mistakes.

!!! tip<br>
You can explicitly set `--convert <type>` to specify how to convert the model.

### Pooler Configuration

#### Predefined models

If the `Pooler` defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--pooler-config` option.

#### Converted models

If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:

| Task | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------|
| `embed` | `LAST` | ✅︎ ||

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults and It can also be specified during model network construction through @default_pooling_type("LAST").

##### Pooling Type

1.LastPool(PoolingType.LAST)

Purpose:Extracts the hidden state of the last token in each sequence

2.AllPool(PoolingType.ALL)

Purpose:Returns the hidden states of all tokens in each sequence

3.CLSPool(PoolingType.CLS)

Purpose:Returns the hidden state of the first token in each sequence (CLS token)

4.MeanPool(PoolingType.MEAN)
Comment on lines +58 to +70
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent formatting: Items should have a space after the number and period (e.g., "1. " instead of "1.") for proper markdown rendering.

Suggested change
1.LastPool(PoolingType.LAST)
Purpose:Extracts the hidden state of the last token in each sequence
2.AllPool(PoolingType.ALL)
Purpose:Returns the hidden states of all tokens in each sequence
3.CLSPool(PoolingType.CLS)
Purpose:Returns the hidden state of the first token in each sequence (CLS token)
4.MeanPool(PoolingType.MEAN)
1. LastPool(PoolingType.LAST)
Purpose:Extracts the hidden state of the last token in each sequence
2. AllPool(PoolingType.ALL)
Purpose:Returns the hidden states of all tokens in each sequence
3. CLSPool(PoolingType.CLS)
Purpose:Returns the hidden state of the first token in each sequence (CLS token)
4. MeanPool(PoolingType.MEAN)

Copilot uses AI. Check for mistakes.

Purpose:Computes the average of all token hidden states in each sequence

## Online Serving

FastDeploy's OpenAI-compatible server provides API endpoints and custom reward interfaces.

[Embeddings API], supports text and multi-modal inputs

[Reward API], scores specific content

Comment on lines +78 to +81
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent formatting: The list items are shown in square brackets [Embeddings API] and [Reward API] when they should be bullet points or properly formatted as a list. Also, the descriptions should maintain parallel structure.

Suggested change
[Embeddings API], supports text and multi-modal inputs
[Reward API], scores specific content
- Embeddings API: Supports text and multi-modal inputs
- Reward API: Scores specific content

Copilot uses AI. Check for mistakes.
### Embedding Model:
```python
model_path=Qwen/Qwen3-Embedding-0.6B

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
--max-num-seqs 256 --max-model-len 32768 \
--port 9412 --engine-worker-queue-port 7142 \
--metrics-port 7211 --tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--runner pooling
```

Request Methods:
A. EmbeddingCompletionRequest Example (Standard Text Input)

```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"input": [
"This is a sentence for pooling embedding.",
"Another input text."
],
"user": "test_client"
}'
```

B. EmbeddingChatRequest Example (Message Sequence Input)

```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"messages": [
{"role": "user", "content": "Generate embedding for user query."}
]
}'
```

### Pooling Model and reward score
```python
model_path=RM_v1008
python -m fastdeploy.entrypoints.openai.api_server \
--model ${model_path} \
--max-num-seqs 256 \
--max-model-len 8192 \
--port 13351 \
--engine-worker-queue-port 7562 \
--metrics-port 7531 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--runner pooling \
--convert embed
```
Request Method: ChatRewardRequest
```bash
curl --location 'http://xxxx/v1/chat/reward' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://xxx/a.png"
}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "图里有几个人"
}
]
}
],
"user": "user-123",
"chat_template": null,
"chat_template_kwargs": {
"custom_var": "value"
},
"mm_processor_kwargs": {
"image_size": 224
}
}'
```
168 changes: 168 additions & 0 deletions docs/zh/features/pooling_models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
[English](../../features/pooling_models.md)

# Pooling Models

FastDeploy也支持pooling模型,例如嵌入(embedding)模型。

在FastDeploy中,池化模型通过`FdModelForPooling`接口。这些模型使用一个`Pooler`来提取输入的最终隐藏状态并返回。

## Configuration

### Model Runner

通过`--runner pooling`选项以池化模型运行模型。

!!! 提示<br>
在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
Comment on lines +15 to +16
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admonition syntax is incorrect (should be !!! 提示 without <br>), and "Fastdeploy" should be "FastDeploy" for consistent capitalization. Also, there's an extra comma before "因为".

Suggested change
!!! 提示<br>
在绝大多数情况下无需手动设置该选项,因此Fastdeploy可以通过--runner auto(默认值)自动检测合适的runner。
!!! 提示
在绝大多数情况下无需手动设置该选项,因此FastDeploy可以通过--runner auto(默认值)自动检测合适的runner。

Copilot uses AI. Check for mistakes.

### Model Conversion

如果模型未实现FdModelForPooling接口但你希望以池化模式运行,FastDeploy可通过`--convert <type>`自动转换模型。

当设置了`--runner pooling`(手动或自动)但模型不符合接口时,FastDeploy会根据模型架构名称自动转换:

| Architecture | `--convert` | 支持的池化类型 |
|-------------------------------------------------|-------------|---------------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma between *Model and **ForProcessRewardModel. It should be a comma-separated list. Also, **ForProcessRewardModel should be *ForProcessRewardModel (single asterisk for consistency).

Suggested change
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` `**ForProcessRewardModel` | `embed` | `embed` |
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model`, `*ForProcessRewardModel` | `embed` | `embed` |

Copilot uses AI. Check for mistakes.

!!! 提示<br>
你可以显示设置`--convert <type>`来制定模型转换方式。
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "显示" should be "显式" (explicitly), and "制定" should be "指定" (specify).

Suggested change
你可以显示设置`--convert <type>`来制定模型转换方式
你可以显式设置`--convert <type>`来指定模型转换方式

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +29
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admonition syntax is incorrect. The exclamation marks should not have a space between them and "提示". It should be !!! 提示 without a trailing <br> tag.

Copilot uses AI. Check for mistakes.

### Pooler Configuration

#### Predefined models

如果模型定义的`Pooler`接受pooler_config,你可以通过--pooler_config覆盖部分属性。

#### Converted models

如果模型通过--convert转换,各任务默认的池化配置如下:

| Task | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------|
| `embed` | `LAST` | ✅︎ ||

加载[Sentence Transformers](https://huggingface.co/sentence-transformers)模型时,其`modules.json`配置优于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "优于" should be "优先于" (takes priority over). The current text says "superior to" but should say "takes priority over".

Suggested change
加载[Sentence Transformers](https://huggingface.co/sentence-transformers)模型时,其`modules.json`配置优于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。
加载[Sentence Transformers](https://huggingface.co/sentence-transformers)模型时,其`modules.json`配置优先于默认值,也可以通过@default_pooling_type("LAST")在模型组网时指定。

Copilot uses AI. Check for mistakes.

#### Pooling Type

1.LastPool(PoolingType.LAST)

作用:提取每个序列的最后一个token的隐藏状态

2.AllPool(PoolingType.ALL)

作用:返回每个序列的所有token的隐藏状态

3.CLSPool(PoolingType.CLS)

作用:返回每个序列的第一个token(CLS token)的隐藏状态

4.MeanPool(PoolingType.MEAN)
Comment on lines +57 to +61
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent formatting: Items 1 and 2 use "1." and "2." numbering, while the pattern should continue with "3." and "4." for consistency. The formatting shows "3.CLSPool" and "4.MeanPool" without proper spacing.

Suggested change
3.CLSPool(PoolingType.CLS)
作用:返回每个序列的第一个token(CLS token)的隐藏状态
4.MeanPool(PoolingType.MEAN)
3. CLSPool(PoolingType.CLS)
作用:返回每个序列的第一个token(CLS token)的隐藏状态
4. MeanPool(PoolingType.MEAN)

Copilot uses AI. Check for mistakes.

作用:计算每个序列所有token隐藏状态的平均值

## Online Serving

FastDeploy的OpenAI兼容服务器提供了API的端点和自定义的reward接口

- `Embeddings API`,支持文本和多模态输入
- `Reward API`,给指定的内容打分
Comment on lines +67 to +70
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar and formatting issue: "API的端点" should be "API端点" (redundant 的). Also, the list should use proper bullet points format with dashes, and there should be proper spacing after colons in Chinese text.

Copilot uses AI. Check for mistakes.

### Embedding模型:
```python
model_path=Qwen/Qwen3-Embedding-0.6B

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
--max-num-seqs 256 --max-model-len 32768 \
--port 9412 --engine-worker-queue-port 7142 \
--metrics-port 7211 --tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--runner pooling \

```

请求方式:<br>
A. EmbeddingCompletionRequest 示例(标准文本输入)

```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"input": [
"This is a sentence for pooling embedding.",
"Another input text."
],
"user": "test_client"
}'
```

B. EmbeddingChatRequest 示例(消息序列输入)

```bash
curl -X POST 'YOUR_SERVICE_URL/v1/embeddings' \
-H 'Content-Type: application/json' \
-d '{
"model": "text-embedding-chat-model",
"messages": [
{"role": "user", "content": "Generate embedding for user query."}
]
}'
```

### Pooling模型和打分机制
```python
model_path=RM_v1008
python -m fastdeploy.entrypoints.openai.api_server \
--model ${model_path} \
--max-num-seqs 256 \
--max-model-len 8192 \
--port 13351 \
--engine-worker-queue-port 7562 \
--metrics-port 7531 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.9 \
--runner pooling \
--convert embed \
```

请求方式: ChatRewardRequest

```bash
curl --location 'http://xxxx/v1/chat/reward' \
--header 'Content-Type: application/json' \
--data '{
"model": "",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://xxx/a.png"
}
}
]
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "图里有几个人"
}
]
}
],
"user": "user-123",
"chat_template": null,
"chat_template_kwargs": {
"custom_var": "value"
},
"mm_processor_kwargs": {
"image_size": 224
}
}'
```
2 changes: 1 addition & 1 deletion fastdeploy/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ def override_name_from_config(self):
self.tensor_parallel_size = self.infer_model_mp_num
del self.infer_model_mp_num

if hasattr(self, "num_hidden_layers"):
if hasattr(self, "num_hidden_layers") and self.runner != "pooling":
if hasattr(self, "remove_tail_layer"):
if self.remove_tail_layer is True:
self.num_hidden_layers -= 1
Expand Down
2 changes: 1 addition & 1 deletion fastdeploy/engine/pooling_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def _set_default_parameters(self, model_config: Optional["ModelConfig"]):
self.softmax = True
elif self.task == "reward":
if self.normalize is None:
self.normalize = True
self.normalize = False
else:
raise ValueError(f"Unknown pooling task: {self.task}")

Expand Down
2 changes: 1 addition & 1 deletion fastdeploy/engine/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,7 @@ def from_dict(cls, d: dict):
pooling_params = PoolingParams.from_dict(d["pooling_params"])
else:
sampling_params = SamplingParams.from_dict(d)

if (
isinstance(d.get("multimodal_inputs"), dict)
and isinstance(d["multimodal_inputs"].get("mm_positions"), list)
Expand All @@ -207,7 +208,6 @@ def from_dict(cls, d: dict):
data_processor_logger.error(
f"Convert mm_positions to ImagePosition error: {e}, {str(traceback.format_exc())}"
)

return cls(
request_id=d["request_id"],
prompt=d.get("prompt"),
Expand Down
Loading
Loading