-
-
Notifications
You must be signed in to change notification settings - Fork 107
feat: harden subscription routing #2064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@claude Please review PR |
|
Claude finished @sanity's task —— View job PR Review CompleteTodo List
SummaryThis PR significantly enhances subscription routing reliability by adding contract code fetching, improved candidate peer logging, and router distance defaults adjustments. The changes are generally well-structured and improve the robustness of subscription operations. Key Improvements
Detailed Findings1. Subscription Routing Logic (
|
|
Took another pass over Claude’s feedback:
On the potential recursion: when a seek request can’t find any candidates we attempt a single GET via Testing:
|
| .filter_map(|peer| { | ||
| peer.location | ||
| .map(|loc| (peer, target_location.distance(loc))) | ||
| .map(|peer| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a dejavuh with this, already commented in a previous PR, let's not make upa. distance, filter_map will filter on collect peers which dont have a distance setup anyway, which in any case is not gonna happen in practice, unless is some sort of transient connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Codex] Thanks for flagging. I dropped the synthetic distance helper and restored the original filter_map behaviour so we only consider peers with a recorded location—matching the prior implementation and avoiding any made-up distances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we trying to route an update while the node has not joined the network? that sounds wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iduartgomez You mean if the node is not part of the subscription tree for that contract? The idea was that the update would be routed towards the contract in the hope that it would hit a subscribed node. Is that incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During bootstrap we still pass PeerKeyLocations without a real location (e.g. from connection_manager.own_location() before the node has joined), so dropping them leaves us with no routing candidates.
If the node has no real location is because it hasn't joined yet no? There is no other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think is semantically correct to try execute operations while technically the peer hasn't even join the network successfully. Why not return a proper error in that case and let the caller handle it properly (with retries, waits, etc.).
We don't really need to be smart here, or do we? Is better to make that behavior EXPLICIT and fail fast. IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Codex] @iduartgomez Makes sense. Today the stack still runs PUT/SUBSCRIBE/etc. during that short pre-join window, so removing the fallback immediately regressed tests (routing had zero candidates). Switching to the behavior you describe means adding an explicit “peer not joined” error path through those ops and teaching callers/tests to retry once the join completes. I filed #2069 to track that cleanup so we can make the semantics explicit in a follow-up. For this PR I’d prefer to keep the fallback so the current stack stays stable, but happy to reprioritize that issue if you want to tackle the fail-fast change next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanity I leave this at your discretion, but I am afraid this will create unexpected behavior I we keep it around for long, so if we decide to go ahead I think we should prioritize fixing the root cause in a different way and get rid of this fallback and things like peer_distance_or_default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iduartgomez How about if we fix it as soon as this PR stack gets merged? I've marked the issue high-priority. I'm just worried about how complex this PR stack already is and Codex thinks this change would touch a lot of stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
117dac9 to
f4ad769
Compare
c3fdf36 to
e161690
Compare
4103ba5 to
e1f507b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fundamentally there is a problem with conflating network subscriptions and local client application subscriptions. IMO we should distinguish them and handle them differently via different flows (e.g. sending internal event notifications for local ones)
This problem has been permeating across multiple bug fixes and changes and looks like we should address the root cause instead of keep working around it.
|
Taking an overall look at the PR I think we have successfully managed to couple local subscriptions/updates to network subscriptions/updates, which is ... NOT good. The goal long-term would be to separate them absolutely, we shouldn't have to worry about local connections inside ops/ module neither viceversa at the websocket/contract layer. This is what I have been trying to avoid a long time and now we are ending with more coupling instead of less and try to untangle that next. Let's move forward with this, I know the next PR also reinforces coupling of the transport and our logical connect network operation, although that one is harder to solve, the handshake handler was supposed to isolate peers which are logically connected from the network to those which are not, but this PR stack has nuked that. Maybe some bugs were fixed along the way idk but seems like a pyrrhic victory for long term maintainability. |
iduartgomez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to rebase on main for merge.
e1f507b to
0a1a2f5
Compare
Summary
Testing