You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/simple_example/).
205
205
206
+
## Mixed Precision Context and Generation
207
+
208
+
In disaggregated serving, the context workers and generation workers have different performance characteristics: context workers are compute-bound while generation workers are memory-bound. Therefore, it may be beneficial to run context workers and generation workers in different precisions.
209
+
210
+
### Prerequisites
211
+
212
+
To enable mixed precision serving, you will need:
213
+
1. A quantized checkpoint created with [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
214
+
2. The original unquantized checkpoint (Can also be quantized)
215
+
3. Both checkpoints must use the same KV cache dtype to ensure compatibility during transfer
216
+
217
+
### Example (BF 16 Ctx, FP 8 Gen)
218
+
219
+
A quantized checkpoint can be created using `--kv_cache_qformat none`.
0 commit comments