Skip to content

Conversation

@NoahOksuz
Copy link
Contributor

Summary

I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in ggml-cpu via multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.

  • Vector ops: SIMD hardswish/hardsigmoid; improved trailing‑element handling
  • Matmul/quant: parallelized A‑quant for cache locality and core utilization
  • Norms: SIMD reductions in RMSNorm (fwd/bwd), GroupNorm, L2 norm
  • KV cache: reordered conditions, hoisted invariants, simplified mask generation
  • Repack: SIMD absolute‑max for generic quant flows (Q8_0 4x4/4x8, Q8_K 4x8)

Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.

Changes by area

  • vec (ggml/src/ggml-cpu/vec.cpp, vec.h)
    • Added SIMD implementations for hardswish/hardsigmoid across ISAs
    • Reduced overhead for tails (clean scalar tails or single‑width fallbacks)
  • mmq (mmq.cpp)
    • Parallelized quantization of matrix A; chunked to preserve locality, reduce contention
  • ops (ggml/src/ggml-cpu/ops.cpp)
    • RMSNorm forward: SIMD sum‑of‑squares
    • RMSNorm backward: SIMD for sum‑of‑squares + dot
    • L2 norm: SIMD reduction
    • GroupNorm: SIMD sum and sum‑of‑squares
  • KV cache (llama-kv-cache.cpp)
    • Condition reordering for better branch prediction
    • Hoisted frequently accessed values outside inner loops
    • Simplified mask generation logic
  • repack (ggml/src/ggml-cpu/repack.cpp)
    • SIMD absolute‑max in generic quant functions for Q8 paths

Performance (CPU backend)

A/B vs prior commit (53d7d21) shows:

  • ADD_ID: up to ~3.8x (shape‑dependent), commonly 1.3–2.0x
  • MUL_MAT / MUL_MAT_ID (quantized paths): many cases 1.2–3.0x; f16/f32 often +5–30%
  • FLASH_ATTN_EXT: frequent 1.2–1.7x gains; a few small‑shape regressions
  • PAD_REFLECT_1D: ~2–6x
  • CPY / SOFT_MAX / CONV2D: mixed; many +5–30%, some regressions (−10–40%) on specific shapes

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 8, 2025
NoahOksuz and others added 2 commits November 8, 2025 22:42
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@CISC
Copy link
Collaborator

CISC commented Nov 8, 2025

I suggest using Add suggestion to batch and commit them all at once (from Files changed).

if (gsmpl) {
// Print grammar sampler performance if available
if (gsmpl->grmr != nullptr) {
llama_perf_sampler_print(gsmpl->grmr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passed my local tests. ill look into it

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Comment on lines +141 to +167
#if defined(__AVX512F__) && defined(__AVX512DQ__)
__m512 vamax = _mm512_setzero_ps();
for (; j + 15 < QK8_0; j += 16) {
__m512 vx = _mm512_loadu_ps(src_row + j);
vamax = _mm512_max_ps(vamax, _mm512_andnot_ps(_mm512_set1_ps(-0.0f), vx));
}
amax = _mm512_reduce_max_ps(vamax);
#elif defined(__AVX2__) && defined(__FMA__)
__m256 vamax = _mm256_setzero_ps();
for (; j + 7 < QK8_0; j += 8) {
__m256 vx = _mm256_loadu_ps(src_row + j);
vamax = _mm256_max_ps(vamax, _mm256_andnot_ps(_mm256_set1_ps(-0.0f), vx));
}
__m128 vamax128 = _mm_max_ps(_mm256_extractf128_ps(vamax, 1), _mm256_castps256_ps128(vamax));
vamax128 = _mm_max_ps(vamax128, _mm_movehl_ps(vamax128, vamax128));
vamax128 = _mm_max_ss(vamax128, _mm_movehdup_ps(vamax128));
amax = _mm_cvtss_f32(vamax128);
#elif defined(__SSE2__)
__m128 vamax = _mm_setzero_ps();
for (; j + 3 < QK8_0; j += 4) {
__m128 vx = _mm_loadu_ps(src_row + j);
vamax = _mm_max_ps(vamax, _mm_andnot_ps(_mm_set1_ps(-0.0f), vx));
}
vamax = _mm_max_ps(vamax, _mm_movehl_ps(vamax, vamax));
vamax = _mm_max_ss(vamax, _mm_movehdup_ps(vamax));
amax = _mm_cvtss_f32(vamax);
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong but shouldn't this be in ggml/src/ggml-cpu/arch/x86/repack.cpp instead?

void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k) {

}
#endif
// Scalar fallback for remaining elements
for (; i00 < ne00; i00++) {
Copy link
Contributor

@Djip007 Djip007 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is best to reduce only at the end.
But modern compiler can same job with some small help:
try this on https://godbolt.org/ with for exemple
GCC / "-march=znver4 -O3 -fopenmp"

#define VECT_SIZE 16

float rmse(float* v, int N) {
    float res_v[VECT_SIZE] = {0};
    int i = 0;
    for (; i<N/VECT_SIZE; ++i) {
#       pragma omp simd
        for (int k=0; k<VECT_SIZE; ++k) {
            res_v[k] += v[i*VECT_SIZE+k]*v[i*VECT_SIZE+k];
        }
    }
    // redution;
    float res = 0;
    for (int k=0; k<VECT_SIZE; ++k) {
        res += res_v[k];
    }
    i *= VECT_SIZE;
    for (; i<N; ++i) {
        res += v[i]*v[i];
    }

    return res;
}

form me in most case intrinsic is only needed when you can't have the same on C like:

__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)

or event simple:

#define VECT_SIZE 16

float rmse(float* v, int N) {
    float res = 0;
#   pragma omp simd simdlen(VECT_SIZE) reduction(+:res)
    for (int i = 0; i<N; ++i) {
        res += v[i]*v[i];
    }
    return res;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants