Skip to content

Commit 0ff7082

Browse files
authored
[Core] Deprecate xformers (#29262)
Signed-off-by: Roger Wang <hey@rogerw.io>
1 parent 5253f42 commit 0ff7082

File tree

31 files changed

+77
-963
lines changed

31 files changed

+77
-963
lines changed

docker/Dockerfile.nightly_torch

Lines changed: 1 addition & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -76,34 +76,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \
7676
RUN --mount=type=cache,target=/root/.cache/uv \
7777
uv pip install --system -r requirements/common.txt
7878

79-
# must put before installing xformers, so it can install the correct version of xfomrers.
80-
ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'
81-
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
82-
83-
# Build xformers with cuda and torch nightly
84-
# following official xformers guidance: https://github.com/facebookresearch/xformers#build
85-
# todo(elainewy): cache xformers build result for faster build
86-
ARG max_jobs=16
87-
ENV MAX_JOBS=${max_jobs}
88-
ARG XFORMERS_COMMIT=f2de641ef670510cadab099ce6954031f52f191c
89-
90-
ENV CCACHE_DIR=/root/.cache/ccache
91-
RUN --mount=type=cache,target=/root/.cache/ccache \
92-
--mount=type=cache,target=/root/.cache/uv \
93-
echo 'git clone xformers...' \
94-
&& git clone https://github.com/facebookresearch/xformers.git --recursive \
95-
&& cd xformers \
96-
&& git checkout ${XFORMERS_COMMIT} \
97-
&& git submodule update --init --recursive \
98-
&& echo 'finish git clone xformers...' \
99-
&& rm -rf build \
100-
&& python3 setup.py bdist_wheel --dist-dir=../xformers-dist --verbose \
101-
&& cd .. \
102-
&& rm -rf xformers
103-
104-
RUN --mount=type=cache,target=/root/.cache/uv \
105-
uv pip install --system xformers-dist/*.whl --verbose
106-
10779
# build can take a long time, and the torch nightly version fetched from url can be different in next docker stage.
10880
# track the nightly torch version used in the build, when we set up runtime environment we can make sure the version is the same
10981
RUN uv pip freeze | grep -i '^torch\|^torchvision\|^torchaudio' > torch_build_versions.txt
@@ -233,11 +205,6 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/vllm
233205
--mount=type=cache,target=/root/.cache/uv \
234206
uv pip install --system vllm-dist/*.whl --verbose
235207

236-
# install xformers again for the new environment
237-
RUN --mount=type=bind,from=base,src=/workspace/xformers-dist,target=/vllm-workspace/xformers-dist \
238-
--mount=type=cache,target=/root/.cache/uv \
239-
uv pip install --system /vllm-workspace/xformers-dist/*.whl --verbose
240-
241208
ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'
242209

243210
# install package for build flashinfer
@@ -307,7 +274,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
307274
uv pip install --system -r requirements/nightly_torch_test.txt
308275

309276
# Logging to confirm the torch versions
310-
RUN pip freeze | grep -E 'torch|xformers|vllm|flashinfer'
277+
RUN pip freeze | grep -E 'torch|vllm|flashinfer'
311278

312279
# Logging to confirm all the packages are installed
313280
RUN pip freeze

docs/contributing/ci/update_pytorch_version.md

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -98,21 +98,6 @@ to warm it up so that future builds are faster.
9898
<img width="60%" alt="Buildkite new build popup" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
9999
</p>
100100

101-
## Update dependencies
102-
103-
Several vLLM dependencies like xFormers depend on PyTorch and need
104-
to be updated accordingly. Rather than waiting for all of them to publish new
105-
releases (which would take too much time), they can be built from
106-
source to unblock the update process.
107-
108-
### xFormers
109-
110-
```bash
111-
export TORCH_CUDA_ARCH_LIST='7.5 8.0+PTX 9.0a'
112-
MAX_JOBS=16 uv pip install --system \
113-
--no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.32.post2"
114-
```
115-
116101
## Update all the different vLLM platforms
117102

118103
Rather than attempting to update all vLLM platforms in a single pull request, it's more manageable

docs/getting_started/quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
283283

284284
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options:
285285

286-
- On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
286+
- On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`.
287287
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
288288

289289
For AMD ROCm, you can further control the specific Attention implementation using the following variables:

examples/online_serving/openai_embedding_long_text/service.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ API_KEY=${API_KEY:-"your-api-key"}
2222
POOLING_TYPE=${POOLING_TYPE:-"auto"} # auto, MEAN, CLS, LAST
2323
export VLLM_ENABLE_CHUNKED_PROCESSING=true
2424
export CUDA_VISIBLE_DEVICES=2,3,4,5
25-
# export VLLM_ATTENTION_BACKEND=XFORMERS
2625

2726
echo "🚀 Starting vLLM Embedding Server with Enhanced Chunked Processing"
2827
echo "=================================================================="

requirements/cuda.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,5 @@ torch==2.9.0
99
torchaudio==2.9.0
1010
# These must be updated alongside torch
1111
torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
12-
xformers==0.0.33.post1; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch >= 2.9
1312
# FlashInfer should be updated together with the Dockerfile
1413
flashinfer-python==0.5.2

tests/basic_correctness/test_basic_correctness.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -74,9 +74,6 @@ def test_models(
7474
model_executor: str,
7575
enable_prompt_embeds: bool,
7676
) -> None:
77-
if backend == "XFORMERS" and model == "google/gemma-2-2b-it":
78-
pytest.skip(f"{backend} does not support gemma2 with full context length.")
79-
8077
with monkeypatch.context() as m:
8178
m.setenv("VLLM_ATTENTION_BACKEND", backend)
8279

tests/kernels/attention/test_attention.py

Lines changed: 0 additions & 129 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,6 @@
1313
from vllm.platforms import current_platform
1414
from vllm.utils.mem_utils import get_max_shared_memory_bytes
1515

16-
if not current_platform.is_rocm():
17-
from xformers import ops as xops
18-
from xformers.ops.fmha.attn_bias import BlockDiagonalCausalMask
19-
20-
from tests.kernels.utils import make_alibi_bias
21-
2216
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
2317
# This will change depending on the compute capability.
2418
# - 512 as a buffer
@@ -448,129 +442,6 @@ def ref_multi_query_kv_attention(
448442
return torch.cat(ref_outputs, dim=0)
449443

450444

451-
@pytest.mark.parametrize("num_seqs", NUM_PREFILL_SEQS)
452-
@pytest.mark.parametrize("num_heads", NUM_HEADS)
453-
@pytest.mark.parametrize("head_size", HEAD_SIZES)
454-
@pytest.mark.parametrize("dtype", DTYPES)
455-
@pytest.mark.parametrize("seed", SEEDS)
456-
@pytest.mark.parametrize("device", CUDA_DEVICES)
457-
@pytest.mark.skipif(
458-
current_platform.is_rocm(), reason="Xformers backend is not supported on ROCm."
459-
)
460-
@torch.inference_mode()
461-
def test_multi_query_kv_attention(
462-
num_seqs: int,
463-
num_heads: tuple[int, int],
464-
head_size: int,
465-
dtype: torch.dtype,
466-
seed: int,
467-
device: str,
468-
use_alibi: bool = False,
469-
) -> None:
470-
current_platform.seed_everything(seed)
471-
torch.set_default_device(device)
472-
# MAX_SEQ_LEN sometimes causes OOM in the reference implementation.
473-
# As the xformers library is already tested with its own tests, we can use
474-
# a smaller MAX_SEQ_LEN here.
475-
max_len = min(MAX_SEQ_LEN, 4096)
476-
seq_lens = random.sample(range(1, max_len), num_seqs)
477-
num_tokens = sum(seq_lens)
478-
479-
scale = float(1.0 / (head_size**0.5))
480-
num_query_heads, num_kv_heads = num_heads
481-
qkv = torch.empty(
482-
num_tokens, num_query_heads + 2 * num_kv_heads, head_size, dtype=dtype
483-
)
484-
qkv.uniform_(-scale, scale)
485-
query, key, value = qkv.split([num_query_heads, num_kv_heads, num_kv_heads], dim=1)
486-
487-
num_queries_per_kv = num_query_heads // num_kv_heads
488-
if num_queries_per_kv > 1:
489-
# Handle MQA and GQA
490-
key = torch.repeat_interleave(key, num_queries_per_kv, dim=1)
491-
value = torch.repeat_interleave(value, num_queries_per_kv, dim=1)
492-
alibi_bias = None
493-
if use_alibi:
494-
alibi_slopes = torch.randn(num_query_heads, dtype=torch.float)
495-
attn_bias = make_alibi_bias(alibi_slopes, num_kv_heads, dtype, seq_lens)
496-
output = torch.empty_like(query)
497-
start = 0
498-
# Dynamic sequence length not supported with custom attn_bias.
499-
for i, seq_len in enumerate(seq_lens):
500-
end = start + seq_len
501-
out = xops.memory_efficient_attention_forward(
502-
query[None, start:end],
503-
key[None, start:end],
504-
value[None, start:end],
505-
attn_bias=attn_bias[i],
506-
p=0.0,
507-
scale=scale,
508-
)
509-
output[start:end].copy_(out.view_as(query[start:end]))
510-
start += seq_len
511-
# xformers.AttentionBias to Tensor for use in reference impl.
512-
alibi_bias = [
513-
b.materialize((1, num_query_heads, i, i), device=device).squeeze()
514-
for b, i in zip(attn_bias, seq_lens)
515-
]
516-
else:
517-
attn_bias = BlockDiagonalCausalMask.from_seqlens(seq_lens)
518-
output = xops.memory_efficient_attention_forward(
519-
query.unsqueeze(0),
520-
key.unsqueeze(0),
521-
value.unsqueeze(0),
522-
attn_bias=attn_bias,
523-
p=0.0,
524-
scale=scale,
525-
)
526-
output = output.squeeze(0)
527-
528-
cu_seq_lens = [0]
529-
for seq_len in seq_lens:
530-
cu_seq_lens.append(cu_seq_lens[-1] + seq_len)
531-
ref_output = ref_multi_query_kv_attention(
532-
cu_seq_lens,
533-
query,
534-
key,
535-
value,
536-
scale,
537-
alibi_bias,
538-
dtype,
539-
)
540-
atol = get_default_atol(output) if current_platform.is_rocm() else 1e-3
541-
rtol = get_default_rtol(output) if current_platform.is_rocm() else 1e-5
542-
torch.testing.assert_close(output, ref_output, atol=atol, rtol=rtol)
543-
544-
545-
@pytest.mark.parametrize("num_seqs", NUM_PREFILL_SEQS)
546-
@pytest.mark.parametrize("num_heads", NUM_HEADS)
547-
@pytest.mark.parametrize("head_size", [64])
548-
@pytest.mark.parametrize("dtype", DTYPES)
549-
@pytest.mark.parametrize("seed", SEEDS)
550-
@pytest.mark.parametrize("device", CUDA_DEVICES)
551-
@pytest.mark.skipif(
552-
current_platform.is_rocm(), reason="Xformers backend is not supported on ROCm."
553-
)
554-
@torch.inference_mode()
555-
def test_multi_query_kv_attention_with_alibi(
556-
num_seqs: int,
557-
num_heads: tuple[int, int],
558-
head_size: int,
559-
dtype: torch.dtype,
560-
seed: int,
561-
device: str,
562-
) -> None:
563-
return test_multi_query_kv_attention(
564-
num_seqs,
565-
num_heads,
566-
head_size,
567-
dtype,
568-
seed,
569-
device,
570-
use_alibi=True,
571-
)
572-
573-
574445
@pytest.mark.parametrize("attention_cls", [Attention, MultiHeadAttention])
575446
def test_num_heads_not_divisble_by_num_kv_heads(attention_cls: type) -> None:
576447
head_size = 64

tests/kernels/attention/test_attention_selector.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def clear_cache():
3434
}
3535

3636
DEVICE_REGULAR_ATTN_BACKENDS = {
37-
"cuda": ["XFORMERS", "FLASHINFER", "FLASH_ATTN"],
37+
"cuda": ["FLASHINFER", "FLASH_ATTN"],
3838
"hip": ["ROCM_ATTN"],
3939
"cpu": ["CPU_ATTN"],
4040
}
@@ -207,12 +207,6 @@ def test_env(
207207
)
208208
expected = "FLASHINFER"
209209
assert backend.get_name() == expected
210-
elif name == "XFORMERS":
211-
backend = get_attn_backend(
212-
32, torch.float16, None, block_size, use_mla=use_mla
213-
)
214-
expected = "XFORMERS"
215-
assert backend.get_name() == expected
216210
elif name == "FLASH_ATTN":
217211
backend = get_attn_backend(
218212
32, torch.float16, None, block_size, use_mla=use_mla

tests/kernels/attention/test_mha_attn.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,6 @@
2424
def clear_cache():
2525
"""Clear lru cache to ensure each test case runs without caching."""
2626
_cached_get_attn_backend.cache_clear()
27-
# Clear xformers availability cache
28-
import vllm.attention.layer as layer_module
29-
30-
layer_module.USE_XFORMERS_OPS = None
3127

3228

3329
@pytest.mark.parametrize("device", ["cpu", "hip", "cuda"])

tests/kernels/utils.py

Lines changed: 11 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -509,43 +509,6 @@ def pack_qkv(qkv: QKVInputs, device: torch.device | str) -> PackedQKVInputs:
509509
)
510510

511511

512-
def make_alibi_bias(
513-
alibi_slopes: torch.Tensor,
514-
num_kv_heads: int,
515-
dtype: torch.dtype,
516-
seq_lens: list[int],
517-
) -> list[Any]:
518-
"""Create ALiBi biases compatible with xFormers attention tests."""
519-
from xformers.ops.fmha.attn_bias import LowerTriangularMaskWithTensorBias
520-
521-
if alibi_slopes is None:
522-
return [None for _ in seq_lens]
523-
524-
attn_biases: list[Any] = []
525-
num_heads = alibi_slopes.shape[0]
526-
assert num_heads >= num_kv_heads, (
527-
"ALiBi slopes expect at least as many heads as KV heads"
528-
)
529-
530-
for seq_len in seq_lens:
531-
bias = torch.arange(seq_len, dtype=dtype, device=alibi_slopes.device)
532-
bias = bias[None, :] - bias[:, None]
533-
534-
padded_len = (seq_len + 7) // 8 * 8
535-
bias_tensor = torch.empty(
536-
1,
537-
num_heads,
538-
seq_len,
539-
padded_len,
540-
device=alibi_slopes.device,
541-
dtype=dtype,
542-
)[:, :, :, :seq_len].copy_(bias)
543-
bias_tensor.mul_(alibi_slopes[:, None, None])
544-
attn_biases.append(LowerTriangularMaskWithTensorBias(bias_tensor))
545-
546-
return attn_biases
547-
548-
549512
def _make_metadata_tensors(
550513
seq_lens: list[int] | None,
551514
context_lens: list[int] | None,
@@ -649,23 +612,12 @@ def make_kv_cache(
649612
650613
Returns:
651614
652-
* kv_cache: 2 x num_blocks x (block_size * num_heads * head_size)
653-
* for backend 'XFORMERS'
654615
* kv_cache: 2 x num_blocks x block_size x num_heads x head_size
655616
* for backend 'FLASH_ATTN'
656617
"""
657-
if backend == "XFORMERS":
658-
kv_cache = torch.rand((2, num_blocks, block_size * num_heads * head_size)).to(
659-
device
660-
)
661-
elif backend == "FLASH_ATTN":
662-
kv_cache = torch.rand((2, num_blocks, block_size, num_heads, head_size)).to(
663-
device
664-
)
665-
else:
666-
raise ValueError(
667-
f"Unknown backend value: '{backend}'. Expected 'XFORMERS' or 'FLASH_ATTN'."
668-
)
618+
if backend != "FLASH_ATTN":
619+
raise ValueError(f"Unknown backend value: '{backend}'. Expected 'FLASH_ATTN'.")
620+
kv_cache = torch.rand((2, num_blocks, block_size, num_heads, head_size)).to(device)
669621
if default_val is not None:
670622
kv_cache[:, :, :] = default_val
671623
return kv_cache
@@ -843,22 +795,14 @@ def assert_actual_matches_ideal(
843795
* output_under_test: actually observed output value
844796
"""
845797
ideal_output = test_params.packed_qkvo.ideal_output
846-
if backend == "XFORMERS":
847-
torch.testing.assert_close(
848-
ideal_output, output_under_test.view_as(ideal_output)
849-
)
850-
851-
elif backend == "FLASH_ATTN":
852-
# For FlashAttention override the accuracy thresholds to non default
853-
# values since we notice a higher difference between the ideal and
854-
# actual output.
855-
torch.testing.assert_close(
856-
ideal_output, output_under_test.view_as(ideal_output), atol=0.01, rtol=0.016
857-
)
858-
else:
859-
raise ValueError(
860-
f"Unknown backend value: '{backend}'. Expected 'XFORMERS' or 'FLASH_ATTN'."
861-
)
798+
if backend != "FLASH_ATTN":
799+
raise ValueError(f"Unknown backend value: '{backend}'. Expected 'FLASH_ATTN'.")
800+
# For FlashAttention override the accuracy thresholds to non default
801+
# values since we notice a higher difference between the ideal and
802+
# actual output.
803+
torch.testing.assert_close(
804+
ideal_output, output_under_test.view_as(ideal_output), atol=0.01, rtol=0.016
805+
)
862806

863807

864808
# Copied/modified from torch._refs.__init__.py

0 commit comments

Comments
 (0)