Proposal load/store masked #1162

DiamonDinoia · 2025-08-21T15:49:24Z

I made a rough implementation of masked load/store. Before I hash it out. Can I have some early feedback?

Thanks,
Marco

serge-sans-paille · 2025-08-21T19:49:43Z

interesting. Before going into the details, what are the memory effects of a masked load in terms of read memory and value stored? Are the masked elements set to 0 ? to an undefined value ? AVX etc seems to set the value to zero?

I'm not sold on the common implementation which looks quite heavy in scalar operation. I can see that we can't do a plain load followed by an and because it could lead to access to unallocated memory. If the mask were constant, we could optimize statically some common patterns, but with a dynamic mask as you propose...

DiamonDinoia · 2025-08-22T08:58:25Z

Some thoughts:

Undefined for masked values. Since depending on the operations 0 or 1 might be the correct values. In that case they could use the mask itself to initialize the values. Also, because imagine I want to polulate the even elements from one memory location and the odd from another. Masked loads (I think) are faster than a gather. set to 0 following the x86 convention
We could remove the dynamic mask entirely. I added for completeness.
We could do a la vcl and have a load partial, store partial where we just optimize for head and tail. I preferred this solution as I'm assuming xsimd users know the performance implications of the API.
~~For now this is fast only on avx, av2 but for sse even if it heavy on scalar it is slow only when reading bytes or short. We could optimize these cases.~~
I'm not sure about sve/neon
With static masks if the first and the last element are read it is possible to do load+and

In general, I use these operations when I want to vectorize and inner loop that is not a multiple of the simd width. This is a small inner loop nested in a loop executed a lot of times. Depending on the operations, padding sometimes is slower than masking.

DiamonDinoia · 2025-10-08T20:52:07Z

Hi @serge-sans-paille,

This is starting to take shape now. I still need to clean up. But this is a first working implementation.

I extended the mask operation in batch_bool to mimic std::bit so that i could reuse the code for all masked operations. I feel having those living outside the class in detail:: is not the best. They are also useful to the user I think.

DiamonDinoia · 2025-10-10T17:58:55Z

@serge-sans-paille do you have any clue on why mingw32 fails?

I need to do a final pass make the code coherent (I am mixing reinterpret_casts and c-style casts). Affter that I think will ready for review.

serge-sans-paille · 2025-10-12T05:58:33Z

include/xsimd/arch/common/xsimd_common_memory.hpp

+            }
+            else
+            {
+                // Fallback to runtime path for non prefix/suffix masks


Unless the compiler optimizes those out, the dynamic load is going to perform serveral checks that are going to be always false (none / all).
I'm still not convinced by the relevancy of dynamic mask support. Maybe we could start small with "just" the static mask support and eventually add a dynamic version later if it appears to be needed? But maybe you have specific case in mind ?

It is just that in my mind I can offer a public runtime mask API
then the compile time mask optimizes as much as it can and if there is no alternative it calls the runtime mask.

Same as the swizzle. So I was not planning of optimizing the runtime at all we just get a maskl_load/store where possible or a common since sse does not support masking.

PS:
This can be optimized in the future with using avx on sse or avx512vl but I will open a discussion once this is merged

I am cleaning this to only offer compile time for now and the runtime can be a separate PR

DiamonDinoia · 2025-10-19T04:03:10Z

I don't think the error is due to anything related to this PR. But, I like the looks of it now. Ready to review!

include/xsimd/arch/common/xsimd_common_memory.hpp

serge-sans-paille · 2025-10-27T18:56:34Z

include/xsimd/arch/common/xsimd_common_memory.hpp

+        {
+            constexpr std::size_t size = batch<T_out, A>::size;
+            alignas(A::alignment()) std::array<T_out, size> buffer {};
+            constexpr std::array<bool, size> mask { Values... };


It seems to me that operator[] is not constexpr for std::array in C++11. let's use a plain C array then?

Yes, I'll replace it. It is constexpr from 14/17

serge-sans-paille · 2025-10-27T20:05:03Z

include/xsimd/arch/xsimd_avx.hpp

                return _mm256_insertf128_pd(_mm256_castpd128_pd256(low), high, 1);
            }
+            template <class T>
+            XSIMD_INLINE batch<T, sse4_2> lower_half(batch<T, avx> const& self) noexcept


this seems recursive. don't you want to use self.data ?

serge-sans-paille · 2025-10-27T20:06:26Z

include/xsimd/types/xsimd_batch_constant.hpp

+
+        // Convenience helpers for half splits (require even size and appropriate target arch)
+        template <class A2>
+        static constexpr auto lower_half() noexcept


I'm not a big fan of this, batch_constant size are tied to an architecture...

Maybe we could put some of the function below as free-functions somewhere?

I am not against moving splice outside the class. I implemented it here because it avoid passing the current architecture. It can go in xsimd::common somewhere maybe not common memory as swizzle or other APIs might use this in the future.

It could stay in the same file, just as free functions?

serge-sans-paille · 2025-10-28T09:30:42Z

@DiamonDinoia we plan to do a release in the forthcoming weeks. Do you want this PR to be part of it?

DiamonDinoia · 2025-10-28T14:28:11Z

@DiamonDinoia we plan to do a release in the forthcoming weeks. Do you want this PR to be part of it?

Only if it does not delay the release by much. I would like to have a release by mid-November. This gives me two weeks to update my downstream projects. For an end of the year release.

PS: I am planning on addressing the swizzle issue after this PR.

JohanMabille · 2025-10-30T09:52:00Z

I would like to have a release by mid-November

That is more or less what we planned ;)

serge-sans-paille

A new wave of comments. I have the feeling that we could simplify the implementation a lot by making it more generic.

serge-sans-paille · 2025-10-30T20:54:22Z

include/xsimd/arch/common/xsimd_common_arithmetic.hpp

 #include <limits>
 #include <type_traits>

+#include "../../types/xsimd_batch_constant.hpp"


Why do you need that change?

So I removed some includes form xsimd_common_fwd and replaced them with forward declarations. So know the concrete types needs to be included in some of the other header.

This speed up compilation a bit and I thought that fwd should really contain only forward declarations.

serge-sans-paille · 2025-10-30T21:04:16Z

include/xsimd/arch/xsimd_avx2.hpp

+                return load<A>(mem, Mode {});
+            }
+            // confined to lower 128-bit half (4 lanes) → forward to SSE
+            else XSIMD_IF_CONSTEXPR(mask.countl_zero() >= 4)


Thinking of it, we could remove the need for countl_zero by just masking, something like

if((mask.mask() & 0xF0) == 0)

which would be a decent implementation for

if(mask.lower_half() == 0) // or lower_mask ?

which has the nice property of not encoding any "magical value"

The reason why I went for countl_zero is to maintain parity with std::bit. I feel like this is public API so it is worth for users not having to learn a new naming scheme

serge-sans-paille · 2025-10-30T21:16:15Z

include/xsimd/arch/xsimd_avx.hpp

+            else XSIMD_IF_CONSTEXPR(mask.countr_zero() >= 2)
+            {
+                constexpr auto mhi = ::xsimd::detail::upper_half<sse2>(mask);
+                const batch<double, sse2> hi(_mm256_extractf128_pd(src, 1));


that's a upper_half, right?

serge-sans-paille · 2025-10-30T21:16:15Z

include/xsimd/arch/xsimd_avx.hpp

+            else XSIMD_IF_CONSTEXPR(mask.countl_zero() >= 2)
+            {
+                constexpr auto mlo = ::xsimd::detail::lower_half<sse2>(mask);
+                const batch<double, sse2> lo(_mm256_castpd256_pd128(src));


that's a lower_half, right?

serge-sans-paille · 2025-10-30T21:18:46Z

include/xsimd/arch/xsimd_avx.hpp

+            {
+                _mm256_maskstore_pd(mem, mask.as_batch(), src);
+            }
+        }


epiphany : if you express everything in terms of lower / upper half, you're pretty close to a type-generic implementation that would avoid a lot of redunduncy. just need to wrap the actual _mm256_maskstore* and we're good, right?

I guess you are right. It might be possible to express things in terms of size and halves. This might be recursive though and we might need to arch to half arch, similarly to what happens in widen.

I'll also have to think if the size of the element matters. I'll have a look next week

The reason why I kept the implementation separate is as follows:

The general implementation in terms of upper/lower ais possible. This haves the code in avx, avx2 still has to provide a new implementation so that it can take advantage of the new intrinsics so we for from 2 versions in avx down to 1 and 2 in avx2 down to 1. Same for avx512.

But then for merging we can not use more optimized intrinsics than _mm256_castps128_ps256. Is it worth it?

The solution here is to pass the MaskLoad/Store and the merge function as templates to this general implementation. Then we could have 1 general implementation maybe for all the kernels. But, I think it might be too much templating.

In general also wrapping _mm256_maskstore* also makes a clean way to expose this to the used if we wish to do so.

What do you suggest?

serge-sans-paille · 2025-10-30T21:19:54Z

include/xsimd/arch/xsimd_avx512f.hpp

            return _mm512_loadu_pd(mem);
        }

+        // load_masked


why did you split the load_masked in two parts of the header? The generic epiphany for avx seems to apply here too :-)

serge-sans-paille · 2025-10-31T10:21:04Z

include/xsimd/arch/xsimd_avx.hpp

+            {
+                constexpr auto mlo = ::xsimd::detail::lower_half<sse4_2>(mask);
+                const auto lo = load_masked(mem, mlo, convert<float> {}, Mode {}, sse4_2 {});
+                return batch<float, A>(detail::merge_sse(lo, batch<float, sse4_2>(0.f)));


could be a call to _mm256_castps128_ps256 instead of merge + batch of zero

serge-sans-paille · 2025-10-31T10:21:41Z

include/xsimd/arch/xsimd_avx.hpp

+            {
+                constexpr auto mhi = ::xsimd::detail::upper_half<sse4_2>(mask);
+                const auto hi = load_masked(mem + 4, mhi, convert<float> {}, Mode {}, sse4_2 {});
+                return batch<float, A>(detail::merge_sse(batch<float, sse4_2>(0.f), hi));


I don't know if we could use a similar trick to _mm256_castps128_ps256 here

Related to #1184 and #1162

serge-sans-paille · 2025-11-03T21:23:38Z

@AntoinePrv once you update that branch we can do a release \o/
EDIT: I meant @DiamonDinoia of course :-)

DiamonDinoia · 2025-11-03T21:26:26Z

@AntoinePrv once you update that branch we can do a release \o/

I will work on this asap. I just have something to deliver on another project this week so it might take a bit for me to get back to this.

serge-sans-paille · 2025-11-03T22:30:15Z

Note that once you're done with Intel support, i still have to implement ARM et cie when possible

1. Adds new masked API compile time masks (store_masked and load_masked) 2. General use case optimization 3. New tests 4. x86 kernels 5. Adds new APIs to batch_bool_constant for convenience resembling #include<bit> 6. Tests the new APIs

DiamonDinoia force-pushed the masked-memory-ops branch from 1124f01 to 53da643 Compare August 21, 2025 15:51

DiamonDinoia force-pushed the masked-memory-ops branch from 2ae15bf to 510c335 Compare October 8, 2025 20:49

DiamonDinoia force-pushed the masked-memory-ops branch 8 times, most recently from fd0e777 to ecf4291 Compare October 9, 2025 23:27

DiamonDinoia force-pushed the masked-memory-ops branch from ecf4291 to e88369f Compare October 10, 2025 21:35

serge-sans-paille reviewed Oct 12, 2025

View reviewed changes

DiamonDinoia force-pushed the masked-memory-ops branch 11 times, most recently from e535375 to 4c457ae Compare October 19, 2025 02:18

DiamonDinoia marked this pull request as ready for review October 19, 2025 04:02

DiamonDinoia force-pushed the masked-memory-ops branch from 4c457ae to ec2b824 Compare October 20, 2025 03:27

DiamonDinoia force-pushed the masked-memory-ops branch from ec2b824 to 128a080 Compare October 20, 2025 16:41

DiamonDinoia force-pushed the masked-memory-ops branch from 128a080 to b2a0b28 Compare October 27, 2025 18:20

serge-sans-paille reviewed Oct 27, 2025

View reviewed changes

DiamonDinoia force-pushed the masked-memory-ops branch from b2a0b28 to 3f3acf5 Compare October 27, 2025 22:11

DiamonDinoia mentioned this pull request Oct 30, 2025

Extend support of batch_cast<...> to upcasting to a type twice as big #1184

Merged

serge-sans-paille reviewed Oct 30, 2025

View reviewed changes

serge-sans-paille reviewed Oct 31, 2025

View reviewed changes

serge-sans-paille added a commit that referenced this pull request Oct 31, 2025

Move from split_avx / split_avx512 to lower_half/ upper_half

f4d7a34

Related to #1184 and #1162

serge-sans-paille mentioned this pull request Oct 31, 2025

Move from split_avx / split_avx512 to lower_half/ upper_half #1188

Merged

serge-sans-paille added a commit that referenced this pull request Oct 31, 2025

Move from split_avx / split_avx512 to lower_half/ upper_half

fce8da2

Related to #1184 and #1162

Feat: Load/Store masked API

8104455

1. Adds new masked API compile time masks (store_masked and load_masked) 2. General use case optimization 3. New tests 4. x86 kernels 5. Adds new APIs to batch_bool_constant for convenience resembling #include<bit> 6. Tests the new APIs

DiamonDinoia force-pushed the masked-memory-ops branch from 3f3acf5 to 8104455 Compare November 5, 2025 20:56

Proposal load/store masked #1162

Are you sure you want to change the base?

Proposal load/store masked #1162

Uh oh!

Conversation

DiamonDinoia commented Aug 21, 2025

Uh oh!

serge-sans-paille commented Aug 21, 2025

Uh oh!

DiamonDinoia commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DiamonDinoia commented Oct 8, 2025

Uh oh!

DiamonDinoia commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia commented Oct 19, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille commented Oct 28, 2025

Uh oh!

DiamonDinoia commented Oct 28, 2025

Uh oh!

JohanMabille commented Oct 30, 2025

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serge-sans-paille Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia commented Aug 22, 2025 •

edited

Loading

DiamonDinoia Nov 5, 2025 •

edited

Loading

serge-sans-paille Oct 31, 2025 •

edited

Loading

serge-sans-paille Oct 31, 2025 •

edited

Loading

serge-sans-paille commented Nov 3, 2025 •

edited

Loading