Skip to content

Conversation

@max-krasnyansky
Copy link
Collaborator

When I was testing matmul chunking I noticed that we're still doing barriers for the NOPs.
This PR adds explicit check for NOPs in the graph_compute_thread so that we can skip the barriers.

I instrumented the code to see how many we actually skip for various models and it's quite a bit.
The numbers below are per-graph (ie per-token, etc)

Llama-3.2-1B SKIPPED 192
Llama-3.2-3B SKIPPED 336

Qwen3-0.6B   SKIPPED 336
Qwen3-4B     SKIPPED 432
Qwen3-VL-2B  SKIPPED 336

GPT-OSS-20B  SKIPPED 504

The overall speed up is noticeable for smaller models, but this makes sense in general to avoid waisting cycles.
Here is M4 Pro Qwen3-0.6B on the CPU before and after.

 ./scripts/compare-commits.sh master cpu-skip-nops  llama-bench --device none -m ../gguf/Qwen3-0.6B-Q4_0.gguf -p 128 -n 64 -t 8 -fa 1
...
| Model           | Test   |   t/s master |   t/s cpu-skip-nops |   Speedup |
|:----------------|:-------|-------------:|--------------------:|----------:|
| qwen3 0.6B Q4_0 | pp128  |      1889.64 |             1907.99 |      1.01 |
| qwen3 0.6B Q4_0 | tg64   |       315.90 |              331.45 |      1.05 |

I'm seeing similar bump on the Snapdragons, not so much on the Prompt but definitely for Token Gen.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 10, 2025
@max-krasnyansky max-krasnyansky merged commit 395e286 into ggml-org:master Nov 10, 2025
64 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants