File tree Expand file tree Collapse file tree 1 file changed +20
-2
lines changed
examples/models/core/deepseek_v3 Expand file tree Collapse file tree 1 file changed +20
-2
lines changed Original file line number Diff line number Diff line change @@ -247,6 +247,7 @@ cuda_graph_config:
247247 max_batch_size: 1024
248248enable_attention_dp: false
249249kv_cache_config:
250+ enable_block_reuse: false
250251 dtype: fp8
251252stream_interval: 10
252253EOF
@@ -258,22 +259,34 @@ cat >./extra-llm-api-config.yml <<EOF
258259cuda_graph_config:
259260 enable_padding: true
260261 batch_sizes:
262+ - 2048
261263 - 1024
262264 - 896
263265 - 512
266+ - 384
264267 - 256
268+ - 192
269+ - 160
265270 - 128
271+ - 96
266272 - 64
273+ - 48
267274 - 32
275+ - 24
268276 - 16
269277 - 8
270278 - 4
271279 - 2
272280 - 1
273281kv_cache_config:
282+ enable_block_reuse: false
274283 dtype: fp8
275284stream_interval: 10
276285enable_attention_dp: true
286+ attention_dp_config:
287+ batching_wait_iters: 0
288+ enable_balance: true
289+ timeout_iters: 60
277290EOF
278291```
279292
@@ -285,6 +298,7 @@ cuda_graph_config:
285298 max_batch_size: 1024
286299enable_attention_dp: false
287300kv_cache_config:
301+ enable_block_reuse: false
288302 dtype: fp8
289303 free_gpu_memory_fraction: 0.8
290304stream_interval: 10
@@ -301,7 +315,12 @@ cuda_graph_config:
301315 enable_padding: true
302316 max_batch_size: 512
303317enable_attention_dp: true
318+ attention_dp_config:
319+ batching_wait_iters: 0
320+ enable_balance: true
321+ timeout_iters: 60
304322kv_cache_config:
323+ enable_block_reuse: false
305324 dtype: fp8
306325 free_gpu_memory_fraction: 0.8
307326stream_interval: 10
@@ -316,12 +335,11 @@ trtllm-serve \
316335 --host localhost \
317336 --port 8000 \
318337 --backend pytorch \
319- --max_batch_size 1024 \
338+ --max_batch_size 2048 \
320339 --max_num_tokens 8192 \
321340 --tp_size 8 \
322341 --ep_size 8 \
323342 --pp_size 1 \
324- --kv_cache_free_gpu_memory_fraction 0.9 \
325343 --extra_llm_api_options ./extra-llm-api-config.yml
326344```
327345It's possible seeing OOM issues on some configs. Considering reducing ` kv_cache_free_gpu_mem_fraction ` to a smaller value as a workaround. We're working on the investigation and addressing the problem. If you are using max-throughput config, reduce ` max_num_tokens ` to ` 3072 ` to avoid OOM issues.
You can’t perform that action at this time.
0 commit comments