-
Notifications
You must be signed in to change notification settings - Fork 585
[NVIDIA] Thor & Spark Support #2028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
705d15a
Thor & Spark Support
johnnynunez 5eea497
Merge branch 'flashinfer-ai:main' into main
johnnynunez 16ed2c3
Merge branch 'flashinfer-ai:main' into main
johnnynunez 4958688
fix xqa compilation
yzh119 9e5a259
Merge branch 'main' of https://github.com/johnnynunez/flashinfer into…
yzh119 6f35f3a
Thor & Spark Support
johnnynunez 3b6d0f6
Update release.yml
johnnynunez 0769cf4
Update task_test_jit_cache_package_build_import.sh
johnnynunez b030292
revert
johnnynunez ab28a10
Merge remote-tracking branch 'origin/main'
johnnynunez 563d7fd
revert
johnnynunez dd428d8
Update task_test_jit_cache_package_build_import.sh
johnnynunez 77108b5
Merge branch 'main' of https://github.com/johnnynunez/flashinfer into…
yzh119 43a0d3a
fix xqa thor compilation
yzh119 afeb8dd
thor is only in cuda 13
johnnynunez a4609fa
Merge remote-tracking branch 'origin/main'
johnnynunez cc1d2cc
thor's __CUDA_ARCH__ is 1010 instead of 1100
yzh119 e3836d2
Merge branch 'main' of https://github.com/johnnynunez/flashinfer into…
yzh119 ad31661
Merge branch 'flashinfer-ai:main' into main
johnnynunez bdf2ed8
revert 1010 to 1100
yzh119 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify architecture 1100 grouping and cross-file consistency.
Architecture 1100 (compute capability 11.0, Blackwell/Thor) is being added to the group with
preferedKHeadPartBytes = 128andcacheVTileSeqLen = 64. This assumes hardware characteristics similar to Ampere/Hopper (architectures 800, 870, 900, 1000, 1030). Please verify:mha.cuandutils.cuh. Ensure 1100 is grouped consistently across all related files.Run the following script to verify consistent architecture grouping across the codebase:
Also search for official NVIDIA documentation on compute capability 11.0 specifications:
🏁 Script executed:
Length of output: 5865
🌐 Web query:
💡 Result:
Short answer: PTX/CUDA added targets sm_110 / compute capability 11.0, but NVIDIA has not published a fixed per‑SM / per‑block shared‑memory size for "compute capability 11.0" in a public compute‑capability table — shared‑memory limits are provided per‑device (and in the CUDA Programming Guide Appendix H for published CCs). The PTX docs show sm_110 support; for concrete shared‑memory limits on a given GPU you should query the device at runtime (cudaGetDeviceProperties / deviceQuery) or check the device's entry in the CUDA Programming Guide / CUDA Toolkit docs. See: PTX/ISA release notes (sm_110 added). [1] CUDA Programming Guide (Appendix H) describes where shared memory limits are listed and explains that limits are compute‑capability / device dependent; use cudaGetDeviceProperties or deviceQuery to get the actual per‑SM and per‑block shared memory of your GPU. [2]
Sources:
[1] PTX / Parallel Thread Execution docs — notes adding sm_110.
[2] NVIDIA CUDA C++ Programming Guide — Appendix H (Compute Capabilities) (see shared memory / per‑SM and per‑block limits and recommendation to query device properties).
Pre-existing architecture grouping mismatch between
utils.cuhandmha.cumust be corrected.The addition of architecture 1100 to
mha.culine 96 exposes a critical inconsistency:In
utils.cuh(lines 47-51), shared memory is grouped as:kMAX_SMEM_SIZE = 163 KBkMAX_SMEM_SIZE = 227 KBIn
mha.cu(lines 92-98), parameters are grouped as:preferedKHeadPartBytes = 128,cacheVTileSeqLen = 64Architectures 800 and 870 are incorrectly grouped with the higher-shared-memory tier. Since
preferedKHeadPartBytesandcacheVTileSeqLenare derived from available shared memory, mixing architectures with 163 KB and 227 KB limits will cause the code to exceed shared memory on 800/870 devices.Architecture 1100 itself is correctly placed with 900/1000/1030 (all 227 KB tier), but this pre-existing bug must be fixed by separating 800/870 into their own conditional block.
🤖 Prompt for AI Agents