Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/nightly-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ jobs:
- name: Build wheel in container
env:
DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda == '12.8' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 12.0a' }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0f 12.0f' }}
FLASHINFER_DEV_RELEASE_SUFFIX: ${{ needs.setup.outputs.dev_suffix }}
run: |
# Extract CUDA major and minor versions
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ jobs:
- name: Build wheel in container
env:
DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda == '12.8' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 12.0a' }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0f 12.0f' }}
run: |
# Extract CUDA major and minor versions
CUDA_MAJOR=$(echo "${{ matrix.cuda }}" | cut -d'.' -f1)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ python -m pip install dist/*.whl

`flashinfer-jit-cache` (customize `FLASHINFER_CUDA_ARCH_LIST` for your target GPUs):
```bash
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 10.0a 10.3a 12.0a"
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 10.0a 10.3a 11.0f 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
Expand Down
2 changes: 1 addition & 1 deletion csrc/xqa/mha.cu
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ __constant__ constexpr uint32_t cacheVTileSeqLen = 32;
constexpr uint32_t preferedKHeadPartBytes = 64;
__constant__ constexpr uint32_t cacheVTileSeqLen = 32;
#elif __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 870 || __CUDA_ARCH__ == 900 || \
__CUDA_ARCH__ == 1000 || __CUDA_ARCH__ == 1030
__CUDA_ARCH__ == 1000 || __CUDA_ARCH__ == 1030 || __CUDA_ARCH__ == 1100
Comment on lines 95 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify architecture 1100 grouping and cross-file consistency.

Architecture 1100 (compute capability 11.0, Blackwell/Thor) is being added to the group with preferedKHeadPartBytes = 128 and cacheVTileSeqLen = 64. This assumes hardware characteristics similar to Ampere/Hopper (architectures 800, 870, 900, 1000, 1030). Please verify:

  1. Shared memory capacity: Does architecture 1100 support the larger shared memory requirements implied by this grouping?
  2. Cross-file consistency: A past review flagged that architecture 1010 was grouped inconsistently between mha.cu and utils.cuh. Ensure 1100 is grouped consistently across all related files.
  3. Hardware characteristics: Confirm that architecture 1100's tensor core configuration and warp scheduling match this group rather than the other group (860, 890, 1200, 1210).

Run the following script to verify consistent architecture grouping across the codebase:

Also search for official NVIDIA documentation on compute capability 11.0 specifications:


🏁 Script executed:

#!/bin/bash
# Search for all architecture-specific conditionals involving 1100 or similar constants
# (preferedKHeadPartBytes, kMAX_SMEM_SIZE, cacheVTileSeqLen)

echo "=== Checking architecture 1100 grouping across files ==="
rg -n -C3 '__CUDA_ARCH__.*1100' csrc/

echo -e "\n=== Checking preferedKHeadPartBytes definitions ==="
rg -n -B5 -A2 'preferedKHeadPartBytes.*=' csrc/xqa/

echo -e "\n=== Checking cacheVTileSeqLen definitions ==="
rg -n -B5 -A2 'cacheVTileSeqLen.*=' csrc/xqa/

echo -e "\n=== Checking kMAX_SMEM_SIZE usage ==="
rg -n -B5 -A2 'kMAX_SMEM_SIZE' csrc/xqa/

Length of output: 5865


🌐 Web query:

CUDA compute capability 11.0 sm_110 shared memory specifications

💡 Result:

Short answer: PTX/CUDA added targets sm_110 / compute capability 11.0, but NVIDIA has not published a fixed per‑SM / per‑block shared‑memory size for "compute capability 11.0" in a public compute‑capability table — shared‑memory limits are provided per‑device (and in the CUDA Programming Guide Appendix H for published CCs). The PTX docs show sm_110 support; for concrete shared‑memory limits on a given GPU you should query the device at runtime (cudaGetDeviceProperties / deviceQuery) or check the device's entry in the CUDA Programming Guide / CUDA Toolkit docs. See: PTX/ISA release notes (sm_110 added). [1] CUDA Programming Guide (Appendix H) describes where shared memory limits are listed and explains that limits are compute‑capability / device dependent; use cudaGetDeviceProperties or deviceQuery to get the actual per‑SM and per‑block shared memory of your GPU. [2]

Sources:
[1] PTX / Parallel Thread Execution docs — notes adding sm_110.
[2] NVIDIA CUDA C++ Programming Guide — Appendix H (Compute Capabilities) (see shared memory / per‑SM and per‑block limits and recommendation to query device properties).


Pre-existing architecture grouping mismatch between utils.cuh and mha.cu must be corrected.

The addition of architecture 1100 to mha.cu line 96 exposes a critical inconsistency:

In utils.cuh (lines 47-51), shared memory is grouped as:

  • Architectures 800, 870: kMAX_SMEM_SIZE = 163 KB
  • Architectures 900, 1000, 1030, 1100: kMAX_SMEM_SIZE = 227 KB

In mha.cu (lines 92-98), parameters are grouped as:

  • Architectures 800, 870, 900, 1000, 1030, 1100: preferedKHeadPartBytes = 128, cacheVTileSeqLen = 64

Architectures 800 and 870 are incorrectly grouped with the higher-shared-memory tier. Since preferedKHeadPartBytes and cacheVTileSeqLen are derived from available shared memory, mixing architectures with 163 KB and 227 KB limits will cause the code to exceed shared memory on 800/870 devices.

Architecture 1100 itself is correctly placed with 900/1000/1030 (all 227 KB tier), but this pre-existing bug must be fixed by separating 800/870 into their own conditional block.

🤖 Prompt for AI Agents
In csrc/xqa/mha.cu around lines 95-96, the architecture conditional incorrectly
groups 800 and 870 with the higher 227KB shared-memory tier
(900/1000/1030/1100); split the conditional so architectures 800 and 870 are in
their own branch (matching utils.cuh's 163KB tier) and keep 900,1000,1030,1100
together in the higher tier; adjust the block boundaries so
preferedKHeadPartBytes and cacheVTileSeqLen are set appropriately per tier
(lower values for 800/870, unchanged for 900/1000/1030/1100) to avoid exceeding
shared memory on 800/870 devices.

constexpr uint32_t preferedKHeadPartBytes = 128;
__constant__ constexpr uint32_t cacheVTileSeqLen = 64;
#else
Expand Down
3 changes: 2 additions & 1 deletion csrc/xqa/utils.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,8 @@ __constant__ constexpr float kE4M3_MAX = 448.F;
constexpr uint32_t kMAX_SMEM_SIZE = (99u << 10);
#elif __CUDA_ARCH__ == 800 || __CUDA_ARCH__ == 870
constexpr uint32_t kMAX_SMEM_SIZE = (163u << 10);
#elif __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000 || __CUDA_ARCH__ == 1030
#elif __CUDA_ARCH__ == 900 || __CUDA_ARCH__ == 1000 || __CUDA_ARCH__ == 1030 || \
__CUDA_ARCH__ == 1100
constexpr uint32_t kMAX_SMEM_SIZE = (227u << 10);
#endif
#endif
Expand Down
2 changes: 1 addition & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ You can follow the steps below to install FlashInfer from source code:

.. code-block:: bash

export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 10.0a 10.3a 12.0a"
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 10.0a 10.3a 11.0f 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
Expand Down
11 changes: 10 additions & 1 deletion scripts/task_test_jit_cache_package_build_import.sh
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,16 @@ arches = ["7.5", "8.0", "8.9", "9.0a"]
if cuda_ver is not None:
try:
major, minor = map(int, cuda_ver.split(".")[:2])
if (major, minor) >= (12, 8):
if (major, minor) >= (13, 0):
arches.append("10.0a")
arches.append("10.3a")
arches.append("11.0f")
arches.append("12.0f")
elif (major, minor) >= (12, 9):
arches.append("10.0a")
arches.append("10.3a")
arches.append("12.0f")
elif (major, minor) >= (12, 8):
arches.append("10.0a")
arches.append("12.0a")
except Exception:
Expand Down
Loading