Skip to content

Commit 25bd2e6

Browse files
authored
[None][doc] Add DeepSeek-V3.2-Exp document (#9141)
Signed-off-by: Fanrong Li <23290157+lfr-0531@users.noreply.github.com>
1 parent 8bd7791 commit 25bd2e6

File tree

1 file changed

+21
-10
lines changed

1 file changed

+21
-10
lines changed

examples/models/core/deepseek_v3/README.md

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
# DeepSeek‑V3 and DeepSeek-R1
1+
# DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp
2+
3+
This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1/DeepSeek-V3.2-Exp models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
4+
**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT LLM. DeepSeek-V3.2-Exp features DeepSeek Sparse Attention (DSA), but otherwise shares the same code as DeepSeek-R1 and DeepSeek-V3 in TensorRT LLM. For brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
25

3-
This guide walks you through the examples to run the DeepSeek‑V3/DeepSeek-R1 models using NVIDIA's TensorRT LLM framework with the PyTorch backend.
4-
**DeepSeek-R1 and DeepSeek-V3 share exact same model architecture other than weights differences, and share same code path in TensorRT-LLM, for brevity we only provide one model example, the example command to be used interchangeably by only replacing the model name to the other one**.
56

67
To benchmark the model with best configurations, refer to [DeepSeek R1 benchmarking blog](../../../../docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md).
78

@@ -14,7 +15,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
1415
## Table of Contents
1516

1617

17-
- [DeepSeek‑V3 and DeepSeek-R1](#deepseekv3-and-deepseek-r1)
18+
- [DeepSeek‑V3, DeepSeek-R1, and DeepSeek-V3.2-Exp](#deepseekv3-deepseek-r1-and-deepseekv32-exp)
1819
- [Table of Contents](#table-of-contents)
1920
- [Hardware Requirements](#hardware-requirements)
2021
- [Downloading the Model Weights](#downloading-the-model-weights)
@@ -56,15 +57,15 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
5657
## Hardware Requirements
5758

5859
DeepSeek-v3 has 671B parameters which needs about 671GB GPU memory for FP8 weights, and needs more memories for activation tensors and KV cache.
59-
The minimum hardware requirements for running DeepSeek V3/R1 at FP8/FP4/W4A8 are listed as follows.
60+
The minimum hardware requirements for running DeepSeek V3/R1/V3.2-Exp at FP8/FP4/W4A8 are listed as follows.
6061

61-
| GPU | DeepSeek-V3/R1 FP8 | DeepSeek-V3/R1 FP4 | DeepSeek-V3/R1 W4A8 |
62+
| GPU | DeepSeek-V3/R1/V3.2-Exp FP8 | DeepSeek-V3/R1/V3.2-Exp FP4 | DeepSeek-V3/R1 W4A8 |
6263
| -------- | ------- | -- | -- |
6364
| H100 80GB | 16 | N/A | 8 |
6465
| H20 141GB | 8 | N/A | 4 |
6566
| H20 96GB | 8 | N/A | 4 |
6667
| H200 | 8 | N/A | 4 |
67-
| B200/GB200| Not supported yet, WIP | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |
68+
| B200/GB200| 8 | 4 (8 GPUs is recommended for best perf) | Not supported yet, WIP |
6869

6970
Ampere architecture (SM80 & SM86) is not supported.
7071

@@ -88,6 +89,7 @@ To quickly run DeepSeek-V3, [examples/llm-api/quickstart_advanced.py](../llm-api
8889
cd examples/llm-api
8990
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --tp_size 8
9091
```
92+
Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp, as this model uses the deep_gemm.fp8_paged_mqa_logits kernel, which requires a KV cache block size of 64.
9193

9294
The model will be run by PyTorch backend and generate outputs like:
9395
```
@@ -105,7 +107,7 @@ cd examples/llm-api
105107
python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --spec_decode_algo MTP --spec_decode_max_draft_len N
106108
```
107109

108-
`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared.
110+
`N` is the number of MTP modules. When `N` is equal to `0`, which means that MTP is not used (default). When `N` is greater than `0`, which means that `N` MTP modules are enabled. In the current implementation, the weight of each MTP module is shared. Please include `--tokens_per_block 64` when running DeepSeek-V3.2-Exp.
109111

110112
#### Relaxed acceptance
111113
**NOTE: This feature can only be used for DeepSeek R1.**
@@ -737,15 +739,15 @@ mpirun -H <HOST1>:8,<HOST2>:8 \
737739
```
738740

739741
### FlashMLA
740-
TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1.
742+
TensorRT LLM has already integrated FlashMLA in the PyTorch backend. It is enabled automatically when running DeepSeek-V3/R1. When running DeepSeek-V3.2-Exp on Hopper, FlashMLA is the default backend for the sparse MLA.
741743

742744
### FP8 KV Cache and MLA
743745

744746
FP8 KV Cache and MLA quantization could be enabled, which delivers two key performance advantages:
745747
- Compression of the latent KV cache enables larger batch sizes, resulting in higher throughput;
746748
- MLA kernel of the generation phase is accelerated by FP8 arithmetic and reduced KV cache memory access.
747749

748-
FP8 KV Cache and MLA is supported on Hopper and Blackwell. The accuracy loss is small, with GSM8k accuracy drop less than 1%.
750+
FP8 KV Cache and MLA is supported on Hopper and Blackwell for DeepSeek-V3 and DeepSeek-R1, while it is only supported on Blackwell for DeepSeek-V3.2-Exp. The accuracy loss is small, with GPQA accuracy drop less than 1%.
749751
- On Hopper we use the [FP8 FlashMLA kernel](https://github.com/deepseek-ai/FlashMLA/pull/54) from community.
750752
- On Blackwell we use the kernel generated from an internal code-gen based solution called `trtllm-gen`.
751753

@@ -861,3 +863,12 @@ python quickstart_advanced.py --model_dir <YOUR_MODEL_DIR> --enable_chunked_pref
861863
- **GPU Memory:** Adjust `--max_batch_size` and `--max_num_tokens` if you encounter out-of-memory errors.
862864
- **Logs:** Check `/workspace/trt_bench.log` for detailed performance information and troubleshooting messages.
863865
- **Configuration Files:** Verify that the configuration files are correctly formatted to avoid runtime issues.
866+
867+
## Known Issues
868+
- Support for KV Cache Reuse and Chunked Prefill in DeepSeek-V3.2-Exp is currently under development. When running `quickstart_advanced.py`, please include `--disable_kv_cache_reuse` to disable KV Cache Reuse. When using `trtllm-eval`/`trtllm-serve`/`trtllm-bench`, please include the following configuration in the extra llm_api options:
869+
```
870+
kv_cache_config:
871+
enable_block_reuse: false
872+
tokens_per_block: 64
873+
enable_chunked_prefill: false
874+
```

0 commit comments

Comments
 (0)