-
Notifications
You must be signed in to change notification settings - Fork 13.7k
rpc : fix alloc size logic #17116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
rpc : fix alloc size logic #17116
Conversation
|
@jukofyork I think you were doing some experiments with RPC recently - could you check if this change affects significantly the performance in your RPC use cases? |
rgerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking the debug logs at the server side, I believe this change will increase only TTFT and will not affect PP and TG speeds. Do I get this right?
| @@ -8,4 +7,4 @@ | |||
| #endif | |||
|
|
|||
| #define RPC_PROTO_MAJOR_VERSION 3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's bump the protocol version as this is a breaking change
Hm, maybe that's correct. Likely because of the graph reuse logic. If you disable the graph reuse, do you see more RPC calls? LLAMA_GRAPH_REUSE_DISABLE=1 llama-cli ... |
Yes, disabling graph reuse results into a lot of |
|
Great, I didn't realize until now that the graph reuse saves us almost all the extra RPC calls. This maybe makes the need to cache the RPC calls almost redundant as I think this won't have a significant impact on the PP. |
|
Some benches could be useful to confirm that the performance is not significantly affected. And I think it should be good to merge. |
No problem, but it will likely be Thursday before I can run any tests. |
|
I think eventually the proper solution to this and other per-tensor calls such as |
fix #16657
ref #16276 (review)
This fixes the RPC inference when Metal backend is involved.
Testing:
TODO: