Skip to content

Commit 255e4ea

Browse files
authored
[None][doc] Update DS-R1 example doc (#9231)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
1 parent 67d3eb2 commit 255e4ea

File tree

1 file changed

+20
-2
lines changed

1 file changed

+20
-2
lines changed

examples/models/core/deepseek_v3/README.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,7 @@ cuda_graph_config:
247247
max_batch_size: 1024
248248
enable_attention_dp: false
249249
kv_cache_config:
250+
enable_block_reuse: false
250251
dtype: fp8
251252
stream_interval: 10
252253
EOF
@@ -258,22 +259,34 @@ cat >./extra-llm-api-config.yml <<EOF
258259
cuda_graph_config:
259260
enable_padding: true
260261
batch_sizes:
262+
- 2048
261263
- 1024
262264
- 896
263265
- 512
266+
- 384
264267
- 256
268+
- 192
269+
- 160
265270
- 128
271+
- 96
266272
- 64
273+
- 48
267274
- 32
275+
- 24
268276
- 16
269277
- 8
270278
- 4
271279
- 2
272280
- 1
273281
kv_cache_config:
282+
enable_block_reuse: false
274283
dtype: fp8
275284
stream_interval: 10
276285
enable_attention_dp: true
286+
attention_dp_config:
287+
batching_wait_iters: 0
288+
enable_balance: true
289+
timeout_iters: 60
277290
EOF
278291
```
279292

@@ -285,6 +298,7 @@ cuda_graph_config:
285298
max_batch_size: 1024
286299
enable_attention_dp: false
287300
kv_cache_config:
301+
enable_block_reuse: false
288302
dtype: fp8
289303
free_gpu_memory_fraction: 0.8
290304
stream_interval: 10
@@ -301,7 +315,12 @@ cuda_graph_config:
301315
enable_padding: true
302316
max_batch_size: 512
303317
enable_attention_dp: true
318+
attention_dp_config:
319+
batching_wait_iters: 0
320+
enable_balance: true
321+
timeout_iters: 60
304322
kv_cache_config:
323+
enable_block_reuse: false
305324
dtype: fp8
306325
free_gpu_memory_fraction: 0.8
307326
stream_interval: 10
@@ -316,12 +335,11 @@ trtllm-serve \
316335
--host localhost \
317336
--port 8000 \
318337
--backend pytorch \
319-
--max_batch_size 1024 \
338+
--max_batch_size 2048 \
320339
--max_num_tokens 8192 \
321340
--tp_size 8 \
322341
--ep_size 8 \
323342
--pp_size 1 \
324-
--kv_cache_free_gpu_memory_fraction 0.9 \
325343
--extra_llm_api_options ./extra-llm-api-config.yml
326344
```
327345
It's possible seeing OOM issues on some configs. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce `max_num_tokens` to `3072` to avoid OOM issues.

0 commit comments

Comments
 (0)