You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This fixes an issue that _mm256_maskload_ps intrinsic used in
remainder-handling logic introduced in
microsoft#23694.
The core of the problem is that _mm256_maskload_ps (and its store
equivalent) can read beyond the masked elements.
Even if mask correctly specifies that you only want to load, for
example, 3 floats, the intrinsic may still read the full 32 bytes (8
floats) from the provided memory address.
The invalid access occurs when one of buffers (input, sin_data, or
cos_data) ends near the boundary of a memory page, and the part of the
32-byte read that you don't care about (i.e., the masked-off part) falls
onto an unmapped page. This will cause a segmentation fault (invalid
access).
The Solution: Use a Scalar Remainder Loop
The simplest, safest, and most robust solution is to replace the masked
AVX remainder logic with a simple scalar loop. This is the exact
strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which
are safe from this bug.
The performance impact of this change will be negligible, as this loop
only processes the final 1-15 elements.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
0 commit comments