Skip to content

Commit 29431db

Browse files
authored
Merge branch 'vllm-project:main' into jha/gemma3_textembedding
2 parents 6c3be9e + df3e30e commit 29431db

26 files changed

+617
-129
lines changed

.cd/Dockerfile.rhel.tenc.pytorch.vllm

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ ARG VLLM_PROJECT_COMMIT=
1919

2020
ARG BASE_NAME
2121
ENV BASE_NAME=${BASE_NAME}
22+
ENV HABANA_VISIBLE_DEVICES=all
2223

2324
ENV OMPI_MCA_btl_vader_single_copy_mechanism=none
2425

.cd/Dockerfile.ubuntu.pytorch.vllm

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
# Parameterize base image components
55
ARG DOCKER_URL=vault.habana.ai/gaudi-docker
6-
ARG VERSION=1.22.0
6+
ARG VERSION=1.22.2
77
ARG BASE_NAME=ubuntu24.04
88
ARG PT_VERSION=2.7.1
99
ARG REVISION=latest
@@ -25,6 +25,7 @@ WORKDIR /root
2525

2626
ENV VLLM_PATH=/workspace/vllm-project
2727
ENV VLLM_PATH2=/workspace/vllm-gaudi
28+
ENV HABANA_VISIBLE_DEVICES=all
2829

2930
RUN echo "dash dash/sh boolean false" | debconf-set-selections && \
3031
DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash

.cd/Dockerfile.ubuntu.pytorch.vllm.nixl.latest

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
# Parameterize base image components
55
ARG DOCKER_URL=vault.habana.ai/gaudi-docker
6-
ARG VERSION=1.22.0
6+
ARG VERSION=1.22.2
77
ARG BASE_NAME=ubuntu22.04
88
ARG PT_VERSION=2.7.1
99
ARG REVISION=latest
@@ -24,6 +24,7 @@ WORKDIR /root
2424

2525
ENV VLLM_PATH=/workspace/vllm-project
2626
ENV VLLM_PATH2=/workspace/vllm-gaudi
27+
ENV HABANA_VISIBLE_DEVICES=all
2728

2829
RUN echo "dash dash/sh boolean false" | debconf-set-selections && \
2930
DEBIAN_FRONTEND=noninteractive dpkg-reconfigure dash

docs/.nav.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ nav:
1010
- getting_started/compatibility_matrix.md
1111
- getting_started/validated_models.md
1212
- Configuration Guides:
13-
- configuration/env_vars.md
13+
- configuration/env_variables.md
1414
- configuration/long_context.md
1515
- Calibration:
1616
- configuration/calibration/calibration.md
@@ -21,7 +21,13 @@ nav:
2121
- configuration/quantization/inc.md
2222
- configuration/quantization/auto_awq.md
2323
- configuration/quantization/gptqmodel.md
24+
- configuration/performance_tuning.md
2425
- configuration/pipeline_parallelism.md
26+
- Warm-up:
27+
- configuration/warm-up/warm-up.md
28+
- configuration/warm-up/sampler_warm-up.md
29+
- configuration/warm-up/defragmenter_warm-up.md
30+
- configuration/warm-up/managing_warm-up.md
2531
- Features:
2632
- features/supported_features.md
2733
- features/bucketing_mechanism.md
File renamed without changes.

docs/configuration/optimization.md

Lines changed: 0 additions & 3 deletions
This file was deleted.
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Performance Tuning
2+
3+
Understanding how configuration settings affect system behavior is essential for effective performance management. This document explains how you can tune and optimize performance.
4+
5+
## Memory Allocation
6+
7+
HPU graphs and the KV cache share the same usable memory pool, determined by `gpu_memory_utilization`. Memory allocation between the two must be balanced to prevent performance degradation. You can find memory consumption information for your model in the logs. They provide device memory usage during model weight loading, profiling runs (using dummy data and without the KV cache), and the final usable memory available before the warm-up phase begins. You can use this information to determine an appropriate bucketing scheme for warm-ups. The following example shows the initial part of the generated server log for the [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) model:
8+
9+
```text hl_lines="3 4 5 7"
10+
INFO 09-24 17:31:39 habana_model_runner.py:590] Pre-loading model weights on hpu:0 took 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.067 GiB of host memory (8.199 GiB/108.2 GiB used)
11+
INFO 09-24 17:31:39 habana_model_runner.py:636] Wrapping in HPU Graph took 0 B of device memory (15.05 GiB/94.62 GiB used) and -3.469 MiB of host memory (8.187 GiB/108.2 GiB used)
12+
INFO 09-24 17:31:39 habana_model_runner.py:640] Loading model weights took in total 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.056 GiB of host memory (8.188 GiB/108.2 GiB used)
13+
INFO 09-24 17:31:40 habana_worker.py:153] Model profiling run took 355 MiB of device memory (15.4 GiB/94.62 GiB used) and 131.4 MiB of host memory (8.316 GiB/108.2 GiB used)
14+
INFO 09-24 17:31:40 habana_worker.py:177] Free device memory: 79.22 GiB, 71.3 GiB usable (gpu_memory_utilization=0.9), 7.13 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.1), 64.17 GiB reserved for KV cache
15+
INFO 09-24 17:31:40 habana_executor.py:85] # HPU blocks: 4107, # CPU blocks: 256
16+
INFO 09-24 17:31:41 habana_worker.py:208] Initializing cache engine took 64.17 GiB of device memory (79.57 GiB/94.62 GiB used) and 1.015 GiB of host memory (9.329 GiB/108.2 GiB used)
17+
```
18+
19+
You can control the ratio between HPU graphs and KV cache using the `VLLM_GRAPH_RESERVED_MEM` environment variable. Increasing the KV cache size enables larger batch processing, improving overall throughput. Enabling [HPU graphs](warm-up/warm-up.md#hpu-graph-capture) helps reduce host [overhead](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html#reducing-host-overhead-with-hpu-graphs) and can lower latency.
20+
21+
The following example shows the warm-up phase logs:
22+
23+
```text
24+
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Prompt captured:24 (100.0%) used_mem:67.72 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
25+
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Decode captured:1 (100.0%) used_mem:64 KiB buckets:[(4, 128)]
26+
INFO 09-24 17:32:13 habana_model_runner.py:1620] Warmup finished in 32 secs, allocated 92.77 MiB of device memory
27+
INFO 09-24 17:32:13 habana_executor.py:91] init_cache_engine took 64.26 GiB of device memory (79.66 GiB/94.62 GiB used) and 1.104 GiB of host memory (9.419 GiB/108.2 GiB used)
28+
```
29+
30+
After analyzing these logs, you should have a good understanding of how much free device memory remains for overhead calculations and how much more could still be used by increasing `gpu_memory_utilization`. You can balance the memory allocation for warm-up bucketing, HPU graphs, and the KV cache to suit your workload requirements.
31+
32+
## Bucketing Mechanism
33+
34+
The [bucketing mechanism](../features/bucketing_mechanism.md) can help optimize performance across different workloads. The vLLM server is pre-configured for heavy decoding scenarios with high request concurrency, using the default maximum batch size strategy (`VLLM_GRAPH_DECODE_STRATEGY`). During low-load periods, this configuration may not be ideal and can be adjusted for smaller batch sizes. For example, modifying bucket ranges via `VLLM_DECODE_BS_BUCKET_{param}` can improve efficiency. For a list of environment variables controlling bucketing behavior, see the [Environment Variables](env_variables.md) document.
35+
36+
## Floating Point 8-bit
37+
38+
Using the Floating Point 8-bit (FP8) data type for large language models reduces memory bandwidth requirements by half compared to BF16. In addition, the FP8 computation is twice as fast as BF16, enabling performance gains even for compute-bound workloads, such as offline inference with large batch sizes.
39+
For more information, see the [Floating Point 8-bit](../features/floating_point_8.md) document.
40+
41+
## Warm-Up
42+
43+
During the development phase, when evaluating a model for inference on vLLM, you may skip the warm-up phase of the server using the `VLLM_SKIP_WARMUP=true` environment variable. This helps to achieve faster testing turnaround times. However, disabling warm-up is acceptable only for development purposes, we strongly recommend keeping it enabled in production environments. Keep warm-up enabled during deployment with optimal number of [buckets](../features/bucketing_mechanism.md).
44+
45+
Warm-up time depends on many factors, such as input and output sequence length, batch size, number of buckets, and data type. It can even take a couple of hours, depending on the configuration. For more information, see the [Warm-up](../features/warmup.md) document.

docs/configuration/quantization/quantization.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ title: Introduction
44

55
# Quantization and Inference
66

7-
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. The Intel® Gaudi® Backend supports following quantization backends:
7+
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. The Intel® Gaudi® Backend supports the following quantization backends:
88

99
- [Intel® Neural Compressor](inc.md)
1010
- [Auto_Awq](auto_awq.md)
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Defragmenter Warm-Up
2+
3+
The defragmenter reclaims and compacts sparse KV-cache block usage at runtime by swapping rarely packed high-index blocks with lower free indices. Its warm-up phase pre-compiles the small swap graphs so that later online defragmentation can execute with near-zero graph compile latency.
4+
5+
Defragmentation may be triggered mid-serving when the highest allocated block index drifts far above the actual number of in-use blocks (fragmentation). The operation itself is a sequence of swap kernels applied over key and value caches. With warm-up, all representative padded sizes are precompiled ahead of time via a deterministic, minimal swap. This ensures that online defragmentation becomes a predictable, low-latency maintenance task. Skipping only the defragmenter warm-up does not compromise correctness; it only increases the risk of sporadic latency when fragmentation first exceeds the threshold that mandates compaction.
6+
7+
The potential consequences of omitting warm-up include:
8+
9+
- The first fragmentation event that requires a previously unseen padded swap size triggers graph capture and compilation on the critical path.
10+
- Compilation latency can manifest as a sudden tail-latency spike for a user request.
11+
- Multiple first-seen swap sizes across different processes may each trigger separate compilations.
12+
13+
You can disable either the warm-up step itself or the entire defragmentation feature. To skip all warm-up phases, including the defragmenter, set `VLLM_SKIP_WARMUP=true`. Alternatively, running without unified attention effectively disables the defragmenter, since it is tied to unified attention; in this case, the warm-up becomes a no-op. Note that there is no separate environment flag in this version to force-enable or disable defragmentation independently of unified attention. Additionally, if supported by your execution mode, you can avoid graph compilation for defragmenter swaps by setting `VLLM_DEFRAG_WITH_GRAPHS=false`. This causes swaps to fall back to regular execution, while the warm-up still exercises them without triggering graph capture.
14+
15+
Related environment variables:
16+
17+
- `VLLM_DEFRAG_THRESHOLD`: Sets the fragmentation trigger heuristic. The default value is 32; lower values make compaction more aggressive.
18+
- `VLLM_DEFRAG_WITH_GRAPHS`: Determines whether swap paths are compiled or graphed. By default, this follows `bridge_mode == eager`.
19+
- `VLLM_DEBUG=defrag`: Enables verbose defragmentation debug logging.
20+
- `VLLM_SKIP_WARMUP`: Disables all warm-up stages including defragmentation.
21+
22+
!!! note
23+
Disabling the defragmenter warm-up does not turn off defragmentation itself, unless unified attention or the feature is entirely disabled. It simply skips ahead-of-time graph preparation, which may shift the compilation cost to the first live fragmentation event.
24+
25+
## Performing Defragmenter Warm-Up
26+
27+
During the main warm-up (`warmup_model`), the system calls the internal `warmup_defragmenter` method after initializing the KV caches and defragmenter. The process is defined by following warm-up steps:
28+
29+
1. Confirming that the defragmenter warm-up feature is enabled, as it only runs when unified attention is enabled, and that the `cache_utils` swap utilities are ready.
30+
2. Establishing the list of padding thresholds: `[8, 16, 32, 64, 128, 256, 512]`.
31+
3. Choosing a minimal valid swap pair `[(1, 0)]` with two distinct block IDs. Only two real blocks are required. Internally, each swap call is padded up to the current threshold length so that a compiled graph for that exact padded size is produced.
32+
4. Iterating through each threshold and invoking a swap. This captures or compiles, depending on the execution mode, the swap graph for that padded size.
33+
5. Performing one extra swap with the first threshold in cases when the number of thresholds is odd. It causes the sequence of swaps to return the KV cache to its original state (net zero logical change).
34+
6. Completing logs.
35+
36+
Future defragmentation swap requests always round or pad to one of these known thresholds. All operational swap sizes hit a pre-compiled path and avoid on-demand compilation latency.
37+
38+
## Logs
39+
40+
The following example presents a typical sequence of logs that appear when there are at least two KV-cache blocks available:
41+
42+
```text
43+
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
44+
INFO 09-22 16:26:27 [hpu_model_runner.py:3452] Defragmenter warmup completed successfully
45+
```
46+
47+
If insufficient blocks exist, such as extremely small test configuration or allocation failure, warm-up is skipped gracefully and you may see logs similar to the following example:
48+
49+
```text
50+
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
51+
WARNING hh:mm:ss hpu_model_runner.py:#### Skipping defragmenter warmup, insufficient blocks (1)
52+
```
53+
54+
To emit fine-grained debug messages during live defragmentation, not the minimal warm-up swaps only, add `VLLM_DEBUG=defrag` to the environment. This way you will be able to see the number of blocks swapped and post-compaction statistics.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Managing and Reducing Warm-up Time
2+
3+
This document provides guidance on reducing warm-up time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies,
4+
and experimental features to improve the model performance.
5+
6+
## Reducing Warm-up Time with HPU Graph Caching
7+
8+
Intel Gaudi software supports caching of compiled HPU graphs using the `PT_HPU_RECIPE_CACHE_CONFIG` environment variable. This can significantly reduce startup time by reusing previously compiled graphs.
9+
10+
Setting the variable requires using the following format:
11+
12+
```python
13+
export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
14+
```
15+
16+
Where:
17+
18+
- `RECIPE_CACHE_PATH`: The directory for storing the compiled graph recipes.
19+
- `RECIPE_CACHE_DELETE`: A boolean that controls cache behavior: when set to `true`, existing contents are cleared before storing new graph-compiled recipes; when set to `false`, the graph-compiled recipes stored in `RECIPE_CACHE_PATH` are reused, which speeds up the warm-up.
20+
- `RECIPE_CACHE_SIZE_MB`: Sets the maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes, based on file creation time. We recommend adjusting the cache directory size according to the model and use case requirements.
21+
22+
The graph compilation process consists of two stages: GC graph compilation and HPU graph compilation. When `PT_HPU_RECIPE_CACHE_CONFIG` is enabled, the GC stage is skipped by reusing cached graphs, significantly reducing overall compilation time. The HPU graph compilation step, however, is still executed. The graph has to be regenerated in the following cases:
23+
24+
- PyTorch container or Intel® Gaudi® software version changes.
25+
- Platform changes, for example Intel® Gaudi® 2 to Intel® Gaudi® 3.
26+
- Model tensor parallelism or data type changes, for example, BF16 to FP8 or FP8 to BF16.
27+
28+
### Storage Recommendations
29+
30+
For scale-up scenarios where caching is shared across processes, we recommend using the local disk. Remote filesystems, such as NFS, should be avoided because they do not support file locking.
31+
32+
In Kubernetes environments, the cache can be stored on a PVC or NFS, but it should be copied to local disk before use.
33+
34+
For a usage example, refer to [Intel Gaudi Tutorials](https://github.com/HabanaAI/Gaudi-tutorials/blob/special/k8s/vllm-8b-cache.yaml).
35+
36+
### Deployment with vLLM
37+
38+
To cache the compiled HPU graphs and reduce the startup time, use one of the following methods.
39+
40+
#### Serving Command
41+
42+
Add the cache parameter to the serving command as shown in the following example for Llama 3.1 8B:
43+
44+
```python
45+
# Store in cache
46+
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
47+
# Replay from cache
48+
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
49+
VLLM_PROMPT_BS_BUCKET_MAX=256 \
50+
VLLM_DECODE_BS_BUCKET_MIN=128 \
51+
VLLM_DECODE_BS_BUCKET_STEP=128 \
52+
VLLM_DECODE_BS_BUCKET_MAX=128 \
53+
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
54+
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
55+
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192
56+
```
57+
58+
This results in the following:
59+
60+
| Precision | Without cache | With cache | Time reduction |
61+
| --------- | ------------- | ---------- | -------------- |
62+
| BF16 | 66 sec | 23 sec | ~65% faster |
63+
| FP8 | 504 sec | 34 sec | ~93% faster |
64+
65+
#### Docker
66+
67+
No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the `-e` flag to set the environment variable:
68+
69+
```
70+
-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
71+
```
72+
73+
## Bucket Management
74+
75+
vLLM warm-up time is determined by the number of HPU graphs that must be compiled to support dynamic shapes. These shapes are influenced by the `batch_size`, `query_length`, and `num_context_blocks`. Setting them according to `max_num_batched_tokens` ensures that additional graphs are not compiled at runtime.
76+
77+
## Exponential Bucketing
78+
79+
The `VLLM_EXPONENTIAL_BUCKETING=True` flag, enabled by default starting with the vLLM `1.21.0-post1` release, switches the bucketing strategy from linear to exponential. This can reduce the number of buckets and warm-up time by up to 80%, while generally maintaining comparable inference performance. In some configurations, however, it may lead to a slight performance drop due to increased padding. This setting is particularly effective for BF16 and FP8 models. To use linear bucketing instead, set `VLLM_EXPONENTIAL_BUCKETING=False`.

0 commit comments

Comments
 (0)