Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 9, 2025

fix #16657
ref #16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 9, 2025
@ggerganov
Copy link
Member Author

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

Copy link
Collaborator

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

@@ -8,4 +7,4 @@
#endif

#define RPC_PROTO_MAJOR_VERSION 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's bump the protocol version as this is a breaking change

@ggerganov
Copy link
Member Author

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

@rgerganov
Copy link
Collaborator

When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?

Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls?

LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ...

Yes, disabling graph reuse results into a lot of RPC_CMD_GET_ALLOC_SIZE calls for each graph computation

@ggerganov
Copy link
Member Author

Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP.

@ggerganov ggerganov marked this pull request as ready for review November 10, 2025 19:34
@ggerganov ggerganov requested a review from slaren as a code owner November 10, 2025 19:34
@ggerganov
Copy link
Member Author

Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge.

@jukofyork
Copy link
Collaborator

@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases?

No problem, but it will likely be Thursday before I can run any tests.

@slaren
Copy link
Member

slaren commented Nov 11, 2025

I think eventually the proper solution to this and other per-tensor calls such as init_tensor will be to batch all the tensors into a single a call. For example, ggml-alloc could be modified to build a list of tensors and obtain all the tensor alloc sizes first, and use that data rather than calling get_alloc_size every time. This way, instead of thousands of calls (each with a network roundtrip), it would only require one. The RPC backend could also cache the results so that calls with identical tensors don't require the network at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Incorrect outputs when running inference in multiple nodes (Mac)

5 participants