|
| 1 | +# Nemotron-nano-v2-VL |
| 2 | + |
| 3 | +## Model series |
| 4 | + * https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 |
| 5 | + * https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 |
| 6 | + * https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD |
| 7 | + |
| 8 | +## Support Matrix |
| 9 | + * BF16 / FP8 / FP4 |
| 10 | + * Tensor Parallel / Pipeline Parallel |
| 11 | + * Inflight Batching |
| 12 | + * PAGED_KV_CACHE |
| 13 | + * checkpoint type: Huggingface (HF) |
| 14 | + * Image / video multimodal input |
| 15 | + |
| 16 | + |
| 17 | +## Offline batch inference example CMDs |
| 18 | + * Taking BF16 model as an example below, you can change to FP8 / FP4 ckpt. |
| 19 | + |
| 20 | + * Image modality input: |
| 21 | + |
| 22 | +```bash |
| 23 | +python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code |
| 24 | +``` |
| 25 | + |
| 26 | + * Image modality input with chunked_prefill: |
| 27 | + |
| 28 | +```bash |
| 29 | +python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --enable_chunked_prefill --max_num_tokens=256 |
| 30 | +``` |
| 31 | + |
| 32 | + * Video modality input: |
| 33 | + |
| 34 | +```bash |
| 35 | +python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --modality video --max_num_tokens 131072 |
| 36 | +``` |
| 37 | + |
| 38 | + * Video modality input with Efficient video sampling (EVS): |
| 39 | + |
| 40 | +```bash |
| 41 | +TLLM_VIDEO_PRUNING_RATIO=0.9 python3 examples/llm-api/quickstart_multimodal.py --model_dir nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 --disable_kv_cache_reuse --max_batch_size 128 --trust_remote_code --modality video --max_num_tokens 131072 |
| 42 | +``` |
| 43 | + |
| 44 | +## Online serving example CMDs |
| 45 | + |
| 46 | + * Taking BF16 model as an example below, you can change to FP8 / FP4 ckpt. |
| 47 | + |
| 48 | +```bash |
| 49 | +# Create extra config file. |
| 50 | +cat > ./extra-llm-api-config.yml << EOF |
| 51 | +kv_cache_config: |
| 52 | + enable_block_reuse: false |
| 53 | + mamba_ssm_cache_dtype: float32 |
| 54 | +EOF |
| 55 | + |
| 56 | +# CMD to launch serve without EVS. |
| 57 | +trtllm-serve \ |
| 58 | +nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16\ |
| 59 | +--host 0.0.0.0 \ |
| 60 | +--port 8000 \ |
| 61 | +--backend pytorch \ |
| 62 | +--max_batch_size 16 \ |
| 63 | +--max_num_tokens 131072 \ |
| 64 | +--trust_remote_code \ |
| 65 | +--media_io_kwargs "{\"video\": {\"fps\": 2, \"num_frames\": 128} }" \ |
| 66 | +--extra_llm_api_options extra-llm-api-config.yml |
| 67 | + |
| 68 | +# CMD to launch serve with EVS (video_pruning_ratio=0.9). |
| 69 | +TLLM_VIDEO_PRUNING_RATIO=0.9 trtllm-serve \ |
| 70 | +nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16\ |
| 71 | +--host 0.0.0.0 \ |
| 72 | +--port 8000 \ |
| 73 | +--backend pytorch \ |
| 74 | +--max_batch_size 16 \ |
| 75 | +--max_num_tokens 131072 \ |
| 76 | +--trust_remote_code \ |
| 77 | +--media_io_kwargs "{\"video\": {\"fps\": 2, \"num_frames\": 128} }" \ |
| 78 | +--extra_llm_api_options extra-llm-api-config.yml |
| 79 | +``` |
| 80 | + |
| 81 | +# Known issue: |
| 82 | + * Don't set too large batch size, otherwise the Mamba cache might raise OOM error. |
| 83 | + * Video modality cannot support chunked prefill yet. |
| 84 | + * Prefix-caching is not supported for Nemotron-nano-v2-VL yet . |
0 commit comments