CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113

NoahOksuz · 2025-11-08T21:27:22Z

Summary

I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in ggml-cpu via multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.

Vector ops: SIMD hardswish/hardsigmoid; improved trailing‑element handling
Matmul/quant: parallelized A‑quant for cache locality and core utilization
Norms: SIMD reductions in RMSNorm (fwd/bwd), GroupNorm, L2 norm
KV cache: reordered conditions, hoisted invariants, simplified mask generation
Repack: SIMD absolute‑max for generic quant flows (Q8_0 4x4/4x8, Q8_K 4x8)

Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.

Changes by area

vec (ggml/src/ggml-cpu/vec.cpp, vec.h)
- Added SIMD implementations for hardswish/hardsigmoid across ISAs
- Reduced overhead for tails (clean scalar tails or single‑width fallbacks)
mmq (mmq.cpp)
- Parallelized quantization of matrix A; chunked to preserve locality, reduce contention
ops (ggml/src/ggml-cpu/ops.cpp)
- RMSNorm forward: SIMD sum‑of‑squares
- RMSNorm backward: SIMD for sum‑of‑squares + dot
- L2 norm: SIMD reduction
- GroupNorm: SIMD sum and sum‑of‑squares
KV cache (llama-kv-cache.cpp)
- Condition reordering for better branch prediction
- Hoisted frequently accessed values outside inner loops
- Simplified mask generation logic
repack (ggml/src/ggml-cpu/repack.cpp)
- SIMD absolute‑max in generic quant functions for Q8 paths

Performance (CPU backend)

A/B vs prior commit (53d7d21) shows:

ADD_ID: up to ~3.8x (shape‑dependent), commonly 1.3–2.0x
MUL_MAT / MUL_MAT_ID (quantized paths): many cases 1.2–3.0x; f16/f32 often +5–30%
FLASH_ATTN_EXT: frequent 1.2–1.7x gains; a few small‑shape regressions
PAD_REFLECT_1D: ~2–6x
CPY / SOFT_MAX / CONV2D: mixed; many +5–30%, some regressions (−10–40%) on specific shapes

ggml/src/ggml-cpu/vec.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ggml/src/ggml-cpu/amx/mmq.cpp

ggml/src/ggml-cpu/vec.cpp

CISC · 2025-11-08T22:47:14Z

I suggest using Add suggestion to batch and commit them all at once (from Files changed).

CISC · 2025-11-08T22:55:23Z

common/sampling.cpp

    if (gsmpl) {
+        // Print grammar sampler performance if available
+        if (gsmpl->grmr != nullptr) {
+            llama_perf_sampler_print(gsmpl->grmr);


It looks like this is crashing:
https://github.com/ggml-org/llama.cpp/actions/runs/19199643005/job/54885609603?pr=17113#step:3:9570
https://github.com/ggml-org/llama.cpp/actions/runs/19199643005/job/54885609411?pr=17113#step:9:270

passed my local tests. ill look into it

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

taronaeo · 2025-11-09T06:08:51Z

ggml/src/ggml-cpu/repack.cpp

+#if defined(__AVX512F__) && defined(__AVX512DQ__)
+            __m512 vamax = _mm512_setzero_ps();
+            for (; j + 15 < QK8_0; j += 16) {
+                __m512 vx = _mm512_loadu_ps(src_row + j);
+                vamax = _mm512_max_ps(vamax, _mm512_andnot_ps(_mm512_set1_ps(-0.0f), vx));
+            }
+            amax = _mm512_reduce_max_ps(vamax);
+#elif defined(__AVX2__) && defined(__FMA__)
+            __m256 vamax = _mm256_setzero_ps();
+            for (; j + 7 < QK8_0; j += 8) {
+                __m256 vx = _mm256_loadu_ps(src_row + j);
+                vamax = _mm256_max_ps(vamax, _mm256_andnot_ps(_mm256_set1_ps(-0.0f), vx));
+            }
+            __m128 vamax128 = _mm_max_ps(_mm256_extractf128_ps(vamax, 1), _mm256_castps256_ps128(vamax));
+            vamax128 = _mm_max_ps(vamax128, _mm_movehl_ps(vamax128, vamax128));
+            vamax128 = _mm_max_ss(vamax128, _mm_movehdup_ps(vamax128));
+            amax = _mm_cvtss_f32(vamax128);
+#elif defined(__SSE2__)
+            __m128 vamax = _mm_setzero_ps();
+            for (; j + 3 < QK8_0; j += 4) {
+                __m128 vx = _mm_loadu_ps(src_row + j);
+                vamax = _mm_max_ps(vamax, _mm_andnot_ps(_mm_set1_ps(-0.0f), vx));
+            }
+            vamax = _mm_max_ps(vamax, _mm_movehl_ps(vamax, vamax));
+            vamax = _mm_max_ss(vamax, _mm_movehdup_ps(vamax));
+            amax = _mm_cvtss_f32(vamax);
+#endif


I may be wrong but shouldn't this be in ggml/src/ggml-cpu/arch/x86/repack.cpp instead?

llama.cpp/ggml/src/ggml-cpu/arch/x86/repack.cpp

Line 178 in aa3b7a9

void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k) {

Djip007 · 2025-11-10T20:40:27Z

ggml/src/ggml-cpu/ops.cpp

+                }
+#endif
+                // Scalar fallback for remaining elements
+                for (; i00 < ne00; i00++) {


it is best to reduce only at the end.
But modern compiler can same job with some small help:
try this on https://godbolt.org/ with for exemple
GCC / "-march=znver4 -O3 -fopenmp"

#define VECT_SIZE 16 float rmse(float* v, int N) { float res_v[VECT_SIZE] = {0}; int i = 0; for (; i<N/VECT_SIZE; ++i) { # pragma omp simd for (int k=0; k<VECT_SIZE; ++k) { res_v[k] += v[i*VECT_SIZE+k]*v[i*VECT_SIZE+k]; } } // redution; float res = 0; for (int k=0; k<VECT_SIZE; ++k) { res += res_v[k]; } i *= VECT_SIZE; for (; i<N; ++i) { res += v[i]*v[i]; } return res; }

form me in most case intrinsic is only needed when you can't have the same on C like:

__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)

or event simple:

#define VECT_SIZE 16 float rmse(float* v, int N) { float res = 0; # pragma omp simd simdlen(VECT_SIZE) reduction(+:res) for (int i = 0; i<N; ++i) { res += v[i]*v[i]; } return res; }

improvements

f3fb1aa

NoahOksuz requested review from ggerganov and slaren as code owners November 8, 2025 21:27

DajanaV mentioned this pull request Nov 8, 2025

UPSTREAM PR #17113: CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack auroralabs-loci/llama.cpp#140

Open

CISC reviewed Nov 8, 2025

View reviewed changes

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/vec.cpp Outdated Show resolved Hide resolved

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 8, 2025

NoahOksuz and others added 2 commits November 8, 2025 22:42

Update ggml/src/ggml-cpu/vec.cpp

2b38912

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update ggml/src/ggml-cpu/vec.cpp

c3fb7b0

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC reviewed Nov 8, 2025

View reviewed changes

Apply suggestions from code review

608adad

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

taronaeo reviewed Nov 9, 2025

View reviewed changes

Djip007 reviewed Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113

CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113

NoahOksuz commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Nov 8, 2025 •

edited

Loading

Uh oh!

CISC Nov 8, 2025

Uh oh!

NoahOksuz Nov 8, 2025

Uh oh!

taronaeo Nov 9, 2025

Uh oh!

Djip007 Nov 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113

Are you sure you want to change the base?

CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113

Conversation

NoahOksuz commented Nov 8, 2025

Summary

Changes by area

Performance (CPU backend)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

NoahOksuz Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

taronaeo Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Djip007 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CISC commented Nov 8, 2025 •

edited

Loading

Djip007 Nov 10, 2025 •

edited

Loading