Skip to content

Conversation

@joaosaffran
Copy link
Collaborator

@joaosaffran joaosaffran commented Nov 20, 2025

This patch adds support to WaveMatch in the HLK Long Vector Tests.
Tested with Wave sizes: 4, 8, 16, 32, 64, 128.

Closes: #7613

@alsepkow
Copy link
Contributor

Were you able to validate these? I would expect these to fail on WARP right now.

@github-project-automation github-project-automation bot moved this from New to In progress in HLSL Roadmap Nov 20, 2025
@joaosaffran joaosaffran marked this pull request as draft November 21, 2025 18:48
@joaosaffran joaosaffran marked this pull request as ready for review November 21, 2025 23:59
@joaosaffran joaosaffran requested a review from damyanp November 21, 2025 23:59
}
uint4 result = WaveMatch(Vector);
uint index = WaveGetLaneIndex() * 4;
if(index < NUM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the intent of using NUM was here.
NUM describes the number of elements in the input Vector, which isn't related to the number of lanes in the wave.
I think what you will want to do is store the mask from each lane. That is, your output should have WaveGetLaneCount() elements. With each element being a uint4. And in this test case you would expect to have two unique masks. With the first mask being for the first group in lane 0 only. And all other lanes in another group (mask).

That would greatly simplify your logic for computing expected values as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't need NUM check here. Thanks for pointing out.

The output is storing the result mask for each lane, the mask is 128 bits long, so I am breaking into groups of 32 bits. This is done, so I can inspect all the bits and make sure the correct ones are being set.

If I store it as a uint4, like here:

g_OutputVector.Store<uint>(WaveGetLaneIndex() * sizeof(uint4), result);

Some bits get truncated to 0, which is not the correct value, since all but one lane will have a different value, the number that I expect here changes if I change the lane that has the different vector. Instead, to make sure the values are correct, I need to store uint each component of WaveMatch resulting mask, like so:

uint index = WaveGetLaneIndex() * 4;

g_OutputVector.Store<uint>(index * sizeof(uint), result.x);
g_OutputVector.Store<uint>((index + 1) * sizeof(uint), result.y);
g_OutputVector.Store<uint>((index + 2) * sizeof(uint), result.z);
g_OutputVector.Store<uint>((index + 3) * sizeof(uint), result.w);

Copy link

@inbelic inbelic Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a typo in your response, but shouldn't it be .Store<uint4>? I wouldn't expect that to truncate anything

@joaosaffran joaosaffran requested a review from alsepkow November 24, 2025 19:34
@inbelic
Copy link

inbelic commented Nov 24, 2025

From pair review with Alex, we were concerned that the WaveMatch spec does not specifically detail how bits for unused lanes are handled, so we should probably assume that it is undefined behaviour. For the sake of testing then we shouldn't check that those bits match being set to 0.

So we might want/need to have the wave size always be 32 and only check the first component. However, we weren't sure if that was easily possible of the top of our heads.

Do drivers ignore those bits or set them to zero?

@damyanp
Copy link
Member

damyanp commented Nov 24, 2025

...we were concerned that the WaveMatch spec does not specifically detail how bits for unused lanes are handled.

It looks to me that the spec does specify this (they're set to 0):

"Bits in the mask corresponding to inactive lanes, or at positions beyond current implementation’s wave width, will contribute 0’s."

So we might want/need to have the wave size always be 32

This isn't possible, not all devices support wave size == 32. In D3D we need to support 4, 8, 16, 32, 64 and 128 wave sizes. (Vulkan also supports 1 and 2).

@inbelic
Copy link

inbelic commented Nov 24, 2025

"Bits in the mask corresponding to inactive lanes, or at positions beyond current implementation’s wave width, will contribute 0’s."

Ah skipped over that sentence... thanks. The last sentence tripped me up as I assumed it was only defined for pixel shaders.

We can disregard needing 32 lanes then as we can match with the expected output = 0.

OP(Wave, WaveMultiPrefixBitAnd, 1, "TestWaveMultiPrefixBitAnd", "", " -DFUNC_WAVE_MULTI_PREFIX_BIT_AND=1 -DIS_WAVE_PREFIX_OP=1", "LongVectorOp", WaveMultiPrefixBitwise, Default2, Default3)
OP(Wave, WaveMultiPrefixBitOr, 1, "TestWaveMultiPrefixBitOr", "", " -DFUNC_WAVE_MULTI_PREFIX_BIT_OR=1 -DIS_WAVE_PREFIX_OP=1", "LongVectorOp", WaveMultiPrefixBitwise, Default2, Default3)
OP(Wave, WaveMultiPrefixBitXor, 1, "TestWaveMultiPrefixBitXor", "", " -DFUNC_WAVE_MULTI_PREFIX_BIT_XOR=1 -DIS_WAVE_PREFIX_OP=1", "LongVectorOp", WaveMultiPrefixBitwise, Default2, Default3)
OP_DEFAULT_DEFINES(Wave, WaveMatch, 1, "TestWaveMatch", "", " -DFUNC_WAVE_MATCH=1 -DIS_WAVE_PREFIX_OP=1")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-DIS_WAVE_PREFIX_OP=1 is required, the test function returns void, since it handles writing to the out vector inside the test function, instead of delegating to main

@joaosaffran joaosaffran requested a review from damyanp November 25, 2025 02:14
@joaosaffran joaosaffran requested a review from damyanp November 25, 2025 17:22
uint64_t LowWaveShift = (LowWaves < 64) ? (1ULL << LowWaves) : 0;
uint64_t HighWaveShift = (HighWaves < 64) ? (1ULL << HighWaves) : 0;

uint64_t result[2] = {(LowWaveShift - 1 & ~1ULL),
Copy link
Member

@damyanp damyanp Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to follow LLVM naming conventions.

Suggested change
uint64_t result[2] = {(LowWaveShift - 1 & ~1ULL),
uint64_t Result[2] = {(LowWaveShift - 1 & ~1ULL),

(note that this is the reason I'm requesting changes - but I really think that this code needs to go away!)

Copy link
Member

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to follow LLVM naming conventions.

I also have feedback on how some of the calculations are structured that I really think you should take.

Comment on lines 1561 to 1565
uint64_t LowWaveShift = (LowWaves < 64) ? (1ULL << LowWaves) : 0;
uint64_t HighWaveShift = (HighWaves < 64) ? (1ULL << HighWaves) : 0;

uint64_t result[2] = {(LowWaveShift - 1 & ~1ULL),
(HighWaveShift - 1 & ~0ULL)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to nudge you in (what I think is) the right direction here. It really bothers me that the "build the mask" and the "what we expect the result to be" are mangled into the same expression. This is my recommended approach that separates the concerns:

Suggested change
uint64_t LowWaveShift = (LowWaves < 64) ? (1ULL << LowWaves) : 0;
uint64_t HighWaveShift = (HighWaves < 64) ? (1ULL << HighWaves) : 0;
uint64_t result[2] = {(LowWaveShift - 1 & ~1ULL),
(HighWaveShift - 1 & ~0ULL)};
uint64_t LowWaveMask = ((LowWaves < 64) ? (1ULL << LowWaves) : 0) - 1;
uint64_t HighWaveMask = ((HighWaves < 64) ? (1ULL << HighWaves) : 0) - 1;
uint64_t LowExpected = ~1ULL & LowWaveMask;
uint64_t HighExpected = ~0ULL & HighWaveMask;

IMO, separating it like this makes it clearer what the different calculations are.

Although this is take-it-or-leave-it, I do really think you should factor it this way!

@joaosaffran joaosaffran requested a review from damyanp November 25, 2025 19:46
@joaosaffran joaosaffran marked this pull request as draft November 25, 2025 20:00
@joaosaffran joaosaffran marked this pull request as ready for review November 26, 2025 02:36
@joaosaffran joaosaffran marked this pull request as draft November 26, 2025 18:16
@joaosaffran joaosaffran marked this pull request as ready for review November 26, 2025 23:19
@joaosaffran joaosaffran merged commit 78c43dd into microsoft:main Nov 27, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in HLSL Roadmap Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Long Vector Execution Tests: WaveMatch

5 participants