[AMDGPU] Search TheRock-based device libraries #739

a4lg · 2025-12-03T01:59:50Z

TheRock (the latest ROCm distribution) has slightly different path (relative to ${ROCM_PATH}) for device libraries.
This commit adds search path for device libraries on a TheRock-based distribution.

Background / Reason of this PR

With TheRock-based distribution, we can install local ROCm SDK as Python packages and that would be helpful a lot.

To fully utilize TheRock-based toolchain, we normally set the environment variable ${ROCM_PATH} to $(rocm-sdk --path root) to make sure that external programs/libraries can locate TheRock-based ROCm SDK.

One of the problems is: some programs (such as vLLM) don't use hipcc but instead, attempt to use $(rocm-sdk path --root)/lib/llvm/bin/clang++ directly. Second, TheRock has slightly different toolchain layout than the release versions of ROCm SDK.

ROCm SDK	Device libraries path
Production Release (7.1)	`${ROCM_PATH}/amdgcn/bitcode`
TheRock (7.9 / Nightly)	`${ROCM_PATH}/lib/llvm/amdgcn/bitcode`

Clang does not search the latter by default, causing a compiler error and fails to configure the ROCm toolchain on such Clang-dependent programs.

The workaround below is working yet inconvenient so I hope this issue is resolved in either:

TheRock distribution (separate PR required): Change the path of device libraries or make links/copies to the same path as the production releases.
ROCm LLVM (this PR): Add TheRock search path

Workaround (without this PR)

With the current state, we can avoid device library issues by:

export HIP_DEVICE_LIB_PATH=$(rocm-sdk path --root)/lib/llvm/amdgcn/bitcode

Why PR to ROCm repository and not upstream?

Because TheRock is in the preview state.

github-actions · 2025-12-03T02:00:18Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

z1-cciauto · 2025-12-03T02:02:29Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/3057

lamb-j · 2025-12-03T15:32:34Z

Even though TheRock is still in preview, we can/should put this upstream.

This patch is a follow up to llvm#170471.

So that we can actually test the workflow before comitting into tree.

This reverts commit 671a8ce. This stil broke the clangd-ubuntu-tsan bot. It seems like somehow the PATH variable is not getting propagated in the system-include-extractor.test test.

Currently Apple unrolling preferences are not applied to apple-a17.

This does not test most of the functionality (i.e., that writing to an existing comment still works), but does ensure that the plumbing works and things are not completely broken. Reviewers: tstellar, cmtice Reviewed By: cmtice Pull Request: llvm#170209

…_epi64 (llvm#170442) This PR adds constexpr support for the AVX2 cross-lane permute intrinsics _mm256_permute4x64_pd and _mm256_permute4x64_epi64 Resolves llvm#169304

This is designed to allow a workflow (e.g., premerge) upload comments across multiple jobs. Subsequent PRs will wire this up within the issue-write workflow to support reading comments from multiple files. Reviewers: tstellar, cmtice Reviewed By: cmtice Pull Request: llvm#170216

The logic for narrowing to single scalar recipes is in two different places: narrowToSingleScalarRecipes and legalizeAndOptimizeInductions. Consolidate them.

We should actually run the test workflow when touching the workflow we are attempting to test.

It appears that using it will result in callers mistakenly thinking that complex f128 returns is done the sret-way, when it should be returned in registers.

This changed the name of one of the outputs that issue-write used to control whether or not it ran. This patch should fix that.

The libcall lowering decisions should be program dependent, depending on the current module's RuntimeLibcallInfo. We need another related analysis derived from that plus the current function's subtarget to provide concrete lowering decisions. This takes on a somewhat unusual form. It's a Module analysis, with a lookup keyed on the subtarget. This is a separate module analysis from RuntimeLibraryAnalysis to avoid that depending on codegen. It's not a function pass to avoid depending on any particular function, to avoid repeated subtarget map lookups in most of the use passes, and to avoid any recomputation in the common case of one subtarget (and keeps it reusable across repeated compilations). This also switches ExpandFp and PreISelIntrinsicLowering as a sample function and module pass. Note this is not yet wired up to SelectionDAG, which is still using the LibcallLoweringInfo constructed inside of TargetLowering.

This reverts commit 5e5937c, since the added test fails on the `lldb-x86_64-win` buildbot. https://lab.llvm.org/buildbot/#/builders/211/builds/4246

This is so that we can read from multiple files emitted by the premerge workflow. Reviewers: tstellar, cmtice Reviewed By: cmtice Pull Request: llvm#170411

…vm#170388) Generate one global per static shared variable so the alignment can be set separately. Dynamic shared memory is unchanged.

Adds more test coverage for llvm#170223.

…on. (llvm#170474) Follow-up to llvm#169576 to enable UDiv canonicalization if the start of the AddRec is not constant. The fold is not restricted to constant start values, as long as we are able to compute a constant remainder. The fold is only applied if the subtraction of the remainder can be folded into to start expression, but that is just to avoid creating more complex AddRecs. For reference, the proof from llvm#169576 is https://alive2.llvm.org/ce/z/iu2tav PR: llvm#170474

This PR adds unrolling for vector.create_mask op based on the targetShape. Each unrolled vector computes its local mask size in each dimension (d) as: min(max(originalMaskSize[d] - offset[d], 0), unrolledMaskSize[d]).

…0420) Disable test due to CASTests failing nondeterministically on some windows CI.

Only some fortran source files in flang/test/Lower/OpenMP have been modified. More cleanups will be performed in subsequent commits.

…170414) Now that the issue-write workflow can support writing comments from multiple files, make the premerge workflow write out comments from both x86_64 Linux and Windows. AArch64 Linux right now is left out as the premerge advisor does not currently support it.

…#169559) Introduce Hexagon-specific passes to generate widening vector instructions for integer and floating-point operations using generic LLVM intrinsics. This enables widening operations for short vectors and improves type legalization by allowing operands to be widened to appropriate types. The patch also includes a shuffle optimization pass to relocate and validate shufflevector instructions during widening legalization. Co-authored-by: Jyotsna Verma <jverma@qti.qualcomm.com> Co-authored-by: Yashas Andaluri <yandalur@qti.qualcomm.com> Co-authored-by: Fateme Hosseini <fhossein@qti.qualcomm.com> Co-authored-by: Muntasir Mallick <mallick@qti.qualcomm.com> Co-authored-by: Tatiana Larina <larina@qti.qualcomm.com> Co-authored-by: Kaushik Kulkarni <kauskulk@qti.qualcomm.com> Co-authored-by: Jyotsna Verma <jverma@qti.qualcomm.com> Co-authored-by: Yashas Andaluri <yandalur@qti.qualcomm.com> Co-authored-by: Muntasir Mallick <mallick@qti.qualcomm.com> Co-authored-by: Tatiana Larina <larina@qti.qualcomm.com> Co-authored-by: Kaushik Kulkarni <kauskulk@qti.qualcomm.com>

This is an NFC patch that aims to simplify how the scanner calls `Consumer.handleBuildCommand()` by doing it directly in `DependencyScanningAction` instead of going through the `setLastCC1Arguments()` and `takeLastCC1Arguments()` dance with the client.

Shared memory needs to be aligned like in the bulk load operation.

…is incomplete (llvm#163654) Indexing in .debug_str section could lead to integer overflow when in DWARF32 format. https://github.com/llvm/llvm-project/blob/e61e6251b692ffe71910bad22b82e41313f003cf/llvm/lib/DWP/DWP.cpp#L35C30-L35C47 This can lead to missing symbols from the DWARF info, and hurts profile quality. As a workaround, we may use information from the symbol table (.symtab), and recover the missing symbols with addresses and ranges. Output: ``` # Before ... warning: 6.64%(338916009/5106344252) of samples are from ranges that do not belong to any functions. # After ... warning: 0.07%(3501587/4906133148) of samples are from ranges that do not belong to any functions. warning: 5.71%(280266919/4906133148) of samples are from ranges that belong to functions recovered from symbol table. ``` We see 0.4% - 1.35% performance improvements on our internal services where profiles are generated with binaries that have .debug_str section over 4GB.

llvm#170564)

This commit adds two Clang builtins for PowerPC AMO load operations: __builtin_amo_lwat for 32-bit unsigned operations __builtin_amo_ldat for 64-bit unsigned operations Also adds an amo.h header that maps GCC's AMO functions to these Clang builtins for compatibility.

Similar to llvm#161881, we will need private/firstprivate/reduction representation for acc kernels for automatic privatization

This PR adds hardware-measured latencies/Occupancy for all RVV load/stores to the SpacemiT-X60 scheduling model.

…#167254) When emitting the assembly .set directive, KCFI needs to use getZExtValue(). However, this means that FileCheck pattern matching can't match between the .set directive and the IR when the high bit of a 32-bit value is set. We had gotten lucky with the existing tests happening to just not have had the high bit set. The coming hash change will expose this, though. LLVM IR's default printing behavior uses APInt::operator<<, which calls APInt::print(OS, /*isSigned=*/true). This means KCFI operand bundles in call instructions print as signed (e.g. [ "kcfi"(i32 -1208803271) ]), and KCFI type metadata prints as signed (e.g. !3 = !{i32 -1208803271}). Changing the IR to print unsigned i32 values would impact hundreds of existing tests, so it is best to just leave it be. Update the KCFI .set direct to use getSExtValue() in a comment so that we can both build correctly and use FileCheck with pattern matching in tests. KCFI generates hashes in two places. Instead of exposing the hash implementation in both places, introduce a helper that wraps the specific hash implementation in a single place, llvm::getKCFITypeID. In order to transition between KCFI hash, we need to be able to specify them. Add the Clang option -fsanitize-kcfi-hash= and a IR module option "kcfi-hash" that can choose between xxHash64 and FNV-1a. Default to xxHash64 to stay backward compatible, as we'll need to also update rustc to take a new option to change the hash to FNV-1a for interop with the coming GCC KCFI.

Fixes: llvm#170552

a4lg · 2025-12-03T23:19:03Z

@lamb-j
Thanks for letting me know. I'll submit the same PR to upstream (and keep this PR open for a while).

On Mac `unsigned long long` corresponds to `uint64_t` so `%llu` is needed. Simply always using `%llu` and upcasting to `unsigned long long` should make it work fine.

…ndencyScanningTool (NFC) (llvm#169962) This patch is the first of two in refactoring Clang's dependency scanning tooling to remove its dependency on clangDriver. It separates Tooling/DependencyScanningTool.cpp from the rest of clangDependencyScanning and moves clangDependencyScanning out of clangTooling into its own library. No functional changes are introduced. The follow-up patch (llvm#169964) will restrict clangDependencyScanning to handling only -cc1 command line inputs and will move all functionality related to handling driver commands into clangTooling. (Tooling/DependencyScanningTool.cpp). This is part of a broader effort to support driver-managed builds for compilations using C++ named modules and/or Clang modules. It is required for linking the dependency scanning tooling against the driver without introducing cyclic dependencies, which would otherwise cause build failures when dynamic linking is enabled. The RFC for this change can be found here: https://discourse.llvm.org/t/rfc-new-clangoptions-library-remove-dependency-on-clangdriver-from-clangfrontend-and-flangfrontend/88773?u=naveen-seth

z1-cciauto · 2025-12-04T00:12:56Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/3075

TheRock has slightly different path (relative to ROCM_PATH) for device libraries. This commit adds search path for device libraries on a TheRock-based distribution. Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>

z1-cciauto · 2025-12-04T00:59:41Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/3076

charles-zablit and others added 27 commits December 3, 2025 20:14

[lldb] ensure comment conforms to LLVM guidelines (llvm#170533)

fc1e911

This patch is a follow up to llvm#170471.

[Github] Add user of issue-write for llvm#170209

c656bf3

So that we can actually test the workflow before comitting into tree.

Revert "[clangd] Enable lit internal shell by default (llvm#170186)"

43b6916

This reverts commit 671a8ce. This stil broke the clangd-ubuntu-tsan bot. It seems like somehow the PATH variable is not getting propagated in the system-include-extractor.test test.

[AArch64] Add unrolling test with -mcpu=apple-a17.

c5e6f4e

Currently Apple unrolling preferences are not applied to apple-a17.

[Clang][X86] Add constexpr support for permute4x64_pd and permute4x64…

8f6e95e

…_epi64 (llvm#170442) This PR adds constexpr support for the AVX2 cross-lane permute intrinsics _mm256_permute4x64_pd and _mm256_permute4x64_epi64 Resolves llvm#169304

[VPlan] Consolidate logic for narrowToSingleScalars (NFCI) (llvm#167360)

7b3ec51

The logic for narrowing to single scalar recipes is in two different places: narrowToSingleScalarRecipes and legalizeAndOptimizeInductions. Consolidate them.

[Github] Also run test issue write when touching issue-write.yml

58dd3a4

We should actually run the test workflow when touching the workflow we are attempting to test.

[SPARC] Remove CCIfConsecutiveRegs for f128 returns (llvm#170133)

3d598c3

It appears that using it will result in callers mistakenly thinking that complex f128 returns is done the sret-way, when it should be returned in registers.

[Github] Fix issue-write workflow after llvm#170216

5cbd294

This changed the name of one of the outputs that issue-write used to control whether or not it ran. This patch should fix that.

Revert "[LLDB] Add SBFrameExtensions Tests (llvm#169236)" (llvm#170555)

b13b41a

This reverts commit 5e5937c, since the added test fails on the `lldb-x86_64-win` buildbot. https://lab.llvm.org/buildbot/#/builders/211/builds/4246

[Github] Make issue-write workflow support reading from multiple files

d2accd3

This is so that we can read from multiple files emitted by the premerge workflow. Reviewers: tstellar, cmtice Reviewed By: cmtice Pull Request: llvm#170411

[flang][cuda] Change how to handle static shared memory variables (ll…

2fb2d7e

…vm#170388) Generate one global per static shared variable so the alignment can be set separately. Dynamic shared memory is unchanged.

[LV] Add more tests for finding the first-iv of argmin.

095f8e0

Adds more test coverage for llvm#170223.

DAG: Use poison for filler values on legalize error paths (llvm#170556)

90606ae

[CAS] Disable CASTests on Windows versions before windows 11 (llvm#17…

1f6d0d8

…0420) Disable test due to CASTests failing nondeterministically on some windows CI.

[flang][NFC] Strip trailing whitespace from tests (9 of N)

e5675a4

Only some fortran source files in flang/test/Lower/OpenMP have been modified. More cleanups will be performed in subsequent commits.

[flang][cuda] Set alignment for tma bulk store (llvm#170558)

848d865

Shared memory needs to be aligned like in the bulk load operation.

[gn build] Port 4da31b6

278d28f

clementval and others added 7 commits December 3, 2025 22:38

[flang][cuda] Do not use address cast for src and dst in TMA bulk load (

e52aece

llvm#170564)

[acc] Add firstprivate/private/reduction to acc.kernels (llvm#170387)

254b33f

Similar to llvm#161881, we will need private/firstprivate/reduction representation for acc kernels for automatic privatization

[RISCV] Update SpacemiT-X60 vector load/stores (llvm#169936)

3b9a0db

This PR adds hardware-measured latencies/Occupancy for all RVV load/stores to the SpacemiT-X60 scheduling model.

[gn build] Port f650330

4bf7c59

[AMDGPU] Fix VGPR lowering for V_DUAL_FMAMK_F32 (llvm#170567)

31ec45a

Fixes: llvm#170552

llvmgnsyncbot and others added 3 commits December 3, 2025 23:21

[gn build] Port 6e47966

d33d7ce

[lldb-dap] Fix format string on Mac OS (llvm#169933)

21f98f9

On Mac `unsigned long long` corresponds to `uint64_t` so `%llu` is needed. Simply always using `%llu` and upcasting to `unsigned long long` should make it work fine.

a4lg force-pushed the amdgpu-therock-device-lib-path branch from 5ed03dd to d3314f0 Compare December 4, 2025 00:10

a4lg requested review from kuhar and nicolasvasilache as code owners December 4, 2025 00:10

[Clang][AMDGPU] Fix a typo

bb52fec

a4lg force-pushed the amdgpu-therock-device-lib-path branch from d3314f0 to aee3336 Compare December 4, 2025 00:11

[Clang][AMDGPU] Search TheRock-based device libraries

b426723

TheRock has slightly different path (relative to ROCM_PATH) for device libraries. This commit adds search path for device libraries on a TheRock-based distribution. Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>

a4lg force-pushed the amdgpu-therock-device-lib-path branch from aee3336 to b426723 Compare December 4, 2025 00:58

a4lg mentioned this pull request Dec 4, 2025

[Clang][AMDGPU] Search TheRock-based device libraries llvm/llvm-project#170590

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Search TheRock-based device libraries #739

[AMDGPU] Search TheRock-based device libraries #739

Uh oh!

a4lg commented Dec 3, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

z1-cciauto commented Dec 3, 2025

Uh oh!

lamb-j commented Dec 3, 2025

Uh oh!

a4lg commented Dec 3, 2025

Uh oh!

z1-cciauto commented Dec 4, 2025

Uh oh!

z1-cciauto commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

[AMDGPU] Search TheRock-based device libraries #739

Are you sure you want to change the base?

[AMDGPU] Search TheRock-based device libraries #739

Uh oh!

Conversation

a4lg commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background / Reason of this PR

Workaround (without this PR)

Why PR to ROCm repository and not upstream?

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

z1-cciauto commented Dec 3, 2025

Uh oh!

lamb-j commented Dec 3, 2025

Uh oh!

a4lg commented Dec 3, 2025

Uh oh!

z1-cciauto commented Dec 4, 2025

Uh oh!

z1-cciauto commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

a4lg commented Dec 3, 2025 •

edited

Loading