-
Notifications
You must be signed in to change notification settings - Fork 13.6k
CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack #17113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
I suggest using |
| if (gsmpl) { | ||
| // Print grammar sampler performance if available | ||
| if (gsmpl->grmr != nullptr) { | ||
| llama_perf_sampler_print(gsmpl->grmr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passed my local tests. ill look into it
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
| #if defined(__AVX512F__) && defined(__AVX512DQ__) | ||
| __m512 vamax = _mm512_setzero_ps(); | ||
| for (; j + 15 < QK8_0; j += 16) { | ||
| __m512 vx = _mm512_loadu_ps(src_row + j); | ||
| vamax = _mm512_max_ps(vamax, _mm512_andnot_ps(_mm512_set1_ps(-0.0f), vx)); | ||
| } | ||
| amax = _mm512_reduce_max_ps(vamax); | ||
| #elif defined(__AVX2__) && defined(__FMA__) | ||
| __m256 vamax = _mm256_setzero_ps(); | ||
| for (; j + 7 < QK8_0; j += 8) { | ||
| __m256 vx = _mm256_loadu_ps(src_row + j); | ||
| vamax = _mm256_max_ps(vamax, _mm256_andnot_ps(_mm256_set1_ps(-0.0f), vx)); | ||
| } | ||
| __m128 vamax128 = _mm_max_ps(_mm256_extractf128_ps(vamax, 1), _mm256_castps256_ps128(vamax)); | ||
| vamax128 = _mm_max_ps(vamax128, _mm_movehl_ps(vamax128, vamax128)); | ||
| vamax128 = _mm_max_ss(vamax128, _mm_movehdup_ps(vamax128)); | ||
| amax = _mm_cvtss_f32(vamax128); | ||
| #elif defined(__SSE2__) | ||
| __m128 vamax = _mm_setzero_ps(); | ||
| for (; j + 3 < QK8_0; j += 4) { | ||
| __m128 vx = _mm_loadu_ps(src_row + j); | ||
| vamax = _mm_max_ps(vamax, _mm_andnot_ps(_mm_set1_ps(-0.0f), vx)); | ||
| } | ||
| vamax = _mm_max_ps(vamax, _mm_movehl_ps(vamax, vamax)); | ||
| vamax = _mm_max_ss(vamax, _mm_movehdup_ps(vamax)); | ||
| amax = _mm_cvtss_f32(vamax); | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong but shouldn't this be in ggml/src/ggml-cpu/arch/x86/repack.cpp instead?
| void ggml_quantize_mat_q8_0_4x8(const float * GGML_RESTRICT x, void * GGML_RESTRICT vy, int64_t k) { |
| } | ||
| #endif | ||
| // Scalar fallback for remaining elements | ||
| for (; i00 < ne00; i00++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is best to reduce only at the end.
But modern compiler can same job with some small help:
try this on https://godbolt.org/ with for exemple
GCC / "-march=znver4 -O3 -fopenmp"
#define VECT_SIZE 16
float rmse(float* v, int N) {
float res_v[VECT_SIZE] = {0};
int i = 0;
for (; i<N/VECT_SIZE; ++i) {
# pragma omp simd
for (int k=0; k<VECT_SIZE; ++k) {
res_v[k] += v[i*VECT_SIZE+k]*v[i*VECT_SIZE+k];
}
}
// redution;
float res = 0;
for (int k=0; k<VECT_SIZE; ++k) {
res += res_v[k];
}
i *= VECT_SIZE;
for (; i<N; ++i) {
res += v[i]*v[i];
}
return res;
}form me in most case intrinsic is only needed when you can't have the same on C like:
__m512 _mm512_dpbf16_ps (__m512 src, __m512bh a, __m512bh b)or event simple:
#define VECT_SIZE 16
float rmse(float* v, int N) {
float res = 0;
# pragma omp simd simdlen(VECT_SIZE) reduction(+:res)
for (int i = 0; i<N; ++i) {
res += v[i]*v[i];
}
return res;
}
Summary
I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in
ggml-cpuvia multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.
Changes by area
ggml/src/ggml-cpu/vec.cpp,vec.h)mmq.cpp)ggml/src/ggml-cpu/ops.cpp)llama-kv-cache.cpp)ggml/src/ggml-cpu/repack.cpp)Performance (CPU backend)
A/B vs prior commit (53d7d21) shows: