Skip to content

Commit ebdd1cc

Browse files
authored
[TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm (#8840)
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
1 parent 20fd305 commit ebdd1cc

File tree

12 files changed

+595
-258
lines changed

12 files changed

+595
-258
lines changed
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Nemotron-nano-v2-VL
2+
3+
## Model series
4+
* https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16
5+
* https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
6+
* https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD
7+
8+
## Support Matrix
9+
* BF16 / FP8 / FP4
10+
* Tensor Parallel / Pipeline Parallel
11+
* Inflight Batching
12+
* PAGED_KV_CACHE
13+
* checkpoint type: Huggingface (HF)
14+
* Image / video multimodal input
15+
16+
17+
## Offline batch inference example CMDs
18+
* Taking BF16 model as an example below, you can change to FP8 / FP4 ckpt.
19+
20+
* Image modality input:
21+
22+
```bash
23+
python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code
24+
```
25+
26+
* Image modality input with chunked_prefill:
27+
28+
```bash
29+
python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --enable_chunked_prefill --max_num_tokens=256
30+
```
31+
32+
* Video modality input:
33+
34+
```bash
35+
python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --modality video --max_num_tokens 131072
36+
```
37+
38+
* Video modality input with Efficient video sampling (EVS):
39+
40+
```bash
41+
TLLM_VIDEO_PRUNING_RATIO=0.9 python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --modality video --max_num_tokens 131072
42+
```
43+
44+
## Online serving example CMDs
45+
46+
* Taking BF16 model as an example below, you can change to FP8 / FP4 ckpt.
47+
48+
```bash
49+
# Create extra config file.
50+
cat > ./extra-llm-api-config.yml << EOF
51+
kv_cache_config:
52+
enable_block_reuse: false
53+
mamba_ssm_cache_dtype: float32
54+
EOF
55+
56+
# CMD to launch serve without EVS.
57+
trtllm-serve \
58+
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16\
59+
--host 0.0.0.0 \
60+
--port 8000 \
61+
--backend pytorch \
62+
--max_batch_size 16 \
63+
--max_num_tokens 131072 \
64+
--trust_remote_code \
65+
--media_io_kwargs "{\"video\": {\"fps\": 2, \"num_frames\": 128} }" \
66+
--extra_llm_api_options extra-llm-api-config.yml
67+
68+
# CMD to launch serve with EVS (video_pruning_ratio=0.9).
69+
TLLM_VIDEO_PRUNING_RATIO=0.9 trtllm-serve \
70+
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16\
71+
--host 0.0.0.0 \
72+
--port 8000 \
73+
--backend pytorch \
74+
--max_batch_size 16 \
75+
--max_num_tokens 131072 \
76+
--trust_remote_code \
77+
--media_io_kwargs "{\"video\": {\"fps\": 2, \"num_frames\": 128} }" \
78+
--extra_llm_api_options extra-llm-api-config.yml
79+
```
80+
81+
# Known issue:
82+
* Don't set too large batch size, otherwise the Mamba cache might raise OOM error.
83+
* Video modality cannot support chunked prefill yet.
84+
* Prefix-caching is not supported for Nemotron-nano-v2-VL yet .

examples/models/core/nemotron/README.md renamed to examples/models/core/nemotron/README_nemotron-3.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Nemotron
1+
# Nemotron-3
22

33
This document demonstrates how to build the Nemotron models using TensorRT LLM and run on a single GPU or multiple GPUs.
44

tensorrt_llm/_torch/models/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@
1515
from .modeling_llava_next import LlavaNextModel
1616
from .modeling_mistral import Mistral3VLM, MistralForCausalLM
1717
from .modeling_mixtral import MixtralForCausalLM
18-
from .modeling_nanov2vlm import NemotronH_Nano_VL_V2
1918
from .modeling_nemotron import NemotronForCausalLM
2019
from .modeling_nemotron_h import NemotronHForCausalLM
20+
from .modeling_nemotron_nano import NemotronH_Nano_VL_V2
2121
from .modeling_nemotron_nas import NemotronNASForCausalLM
2222
from .modeling_phi3 import Phi3ForCausalLM
2323
from .modeling_phi4mm import Phi4MMForCausalLM

0 commit comments

Comments
 (0)