feat: harden subscription routing #2064

sanity · 2025-11-07T17:50:26Z

Summary

ensure subscribes fetch contract code and log candidate peers
bump stdlib + freenet-ping deps and update gateway script
adjust router distance defaults and request density tracing

Testing

cargo test operations::subscribe
cargo test --test operations subscribe_path
cargo test connectivity::test_gateway_reconnection

sanity · 2025-11-07T22:51:37Z

@claude Please review PR

claude · 2025-11-07T22:51:53Z

Claude finished @sanity's task —— View job

PR Review Complete

Todo List

Read all changed files
Review main subscription routing changes (subscribe.rs)
Review router distance and tracing changes
Check for bugs, security issues, and performance problems
Provide comprehensive feedback

Summary

This PR significantly enhances subscription routing reliability by adding contract code fetching, improved candidate peer logging, and router distance defaults adjustments. The changes are generally well-structured and improve the robustness of subscription operations.

Key Improvements

Contract Code Fetching: Added fetch_contract_if_missing() to ensure contract code is available before completing subscriptions
Enhanced Logging: Extensive tracing additions for debugging subscription routing
Router Distance Handling: Fixed edge cases where peers without locations could cause issues
Test Output Cleanup: Replaced println! with debug! for cleaner test output

Detailed Findings

1. Subscription Routing Logic (`subscribe.rs`)

Strengths:

Excellent addition of fetch_contract_if_missing() to ensure contract code is present (lines 54-72)
Good use of wait_for_local_contract() with timeout/polling for async contract availability (lines 38-52)
Improved handling of local subscription scenarios with complete_local_subscription() helper (lines 258-288)
Enhanced skip list management prevents routing loops
Better HTL (hops-to-live) validation and bounds checking (lines 520-521, 536-543)

Issues & Suggestions:

🔴 High Priority - Potential Infinite Recursion
crates/core/src/operations/subscribe.rs:564-580
When no candidates are found for forwarding, the code attempts to fetch the contract locally using GET operation. However, if the GET also fails to find the contract, this could potentially lead to cascading failures. Consider:

// Current approach might retry indefinitely
let get_op = get::start_op(*key, true, false);
if let Err(fetch_err) = get::request_get(op_manager, get_op, HashSet::new()).await

Recommendation: Add a retry limit or context flag to prevent GET operations from triggering additional subscribe operations.

🟡 Medium Priority - Error Handling Inconsistency
crates/core/src/operations/subscribe.rs:771
The fetch_contract_if_missing() call uses ? operator which will propagate errors up, but there's no logging at this level:

fetch_contract_if_missing(op_manager, *key).await?;

Recommendation: Add error logging before propagation:

if let Err(e) = fetch_contract_if_missing(op_manager, *key).await {
    tracing::warn!(tx = %id, %key, error = %e, "Failed to fetch contract code");
    return Err(e);
}

🟡 Medium Priority - Magic Numbers
crates/core/src/operations/subscribe.rs:23-24
The timeout and polling interval constants lack documentation:

const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

Recommendation: Add documentation explaining the rationale for these values:

/// Maximum time to wait for a contract to become available locally after a GET request
const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
/// Interval between checks when polling for contract availability
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

🟢 Low Priority - Verbose Logging
crates/core/src/operations/subscribe.rs:649-679
The subscribers_snapshot() calls before/after subscriber registration create significant log volume. While useful for debugging, this might be overwhelming in production.
Recommendation: Consider using debug! instead of info! for the before/after snapshots, or add a feature flag for verbose subscription logging.

🟢 Low Priority - Code Duplication
crates/core/src/operations/subscribe.rs:405-445 and crates/core/src/operations/subscribe.rs:649-679
Similar subscriber registration logic appears multiple times with slight variations.
Recommendation: Consider extracting into a helper function like register_subscriber_with_logging() to reduce duplication and improve maintainability.

2. Router Distance Defaults (`router/mod.rs`)

Strengths:

Good fix for handling peers without location data (lines 165-169)
Using Distance::new(0.5) as default is reasonable (middle of the ring)

Issues & Suggestions:

🟡 Medium Priority - Inconsistent Handling
crates/core/src/router/mod.rs:165-169 vs crates/core/src/router/mod.rs:206-212
Two different approaches for handling missing locations:

// In select_closest_peers: unwrap_or with default distance
.map(|peer| {
    let distance = peer.location
        .map(|loc| target_location.distance(loc))
        .unwrap_or_else(|| Distance::new(0.5));
    (peer, distance)
})

// In select_k_best_peers: filter_map that excludes peers without location
.filter_map(|peer| {
    peer.location.map(|loc| {
        let distance = target_location.distance(loc);
        (peer, distance)
    })
})

Recommendation: Decide on a consistent strategy. If peers without locations should be included with default distance, apply this uniformly. If they should be excluded, apply that uniformly. The current mixed approach could lead to unexpected behavior.

3. PUT Operation Changes (`operations/put.rs`)

Strengths:

Good fix for preserving upstream reference in broadcast scenarios (lines 888-896)

Issues & Suggestions:

🟢 Low Priority - Logic Complexity
crates/core/src/operations/put.rs:888-896
The preserved_upstream logic with or_else(|| Some(upstream.clone())) is somewhat convoluted:

let upstream_for_completion = preserved_upstream
    .clone()
    .or_else(|| Some(upstream.clone()));

Recommendation: Simplify to:

let upstream_for_completion = preserved_upstream.or(Some(upstream));

(No need for clone() after or_else if you use or)

4. Test Cleanup

Strengths:

Good practice replacing println! with debug! in tests
Maintains debuggability while keeping test output clean

Testing Recommendations

The PR description mentions three test commands. Please also consider:

Stress Testing: Run subscription operations under high load with multiple concurrent subscribes
Network Partition Tests: Test behavior when contract is unavailable and network is partitioned
HTL Boundary Tests: Verify behavior at HTL=0 and HTL=1 edge cases
GET-Subscribe Interaction: Test scenarios where GET operations are triggered from within subscribe operations

Minor Observations

Dependency bumps (stdlib 0.1.14 → 0.1.22) look reasonable
Script changes are minor cleanup improvements
Overall code quality is high with good error handling

Conclusion

This is a solid PR that meaningfully improves subscription routing robustness. The main concerns are:

Potential for cascading failures in GET fallback logic
Inconsistent peer location handling in router
Some opportunities for code simplification

The extensive logging additions are valuable for debugging but consider tunability for production environments.

Recommendation: Approve with minor revisions suggested above.

sanity · 2025-11-08T02:30:05Z

Took another pass over Claude’s feedback:

Documented the local wait constants and replaced the repetitive info! snapshots with register_subscriber_with_logging, which keeps the before/after diagnostics but at debug! level and avoids duplicating the logic (crates/core/src/operations/subscribe.rs).
Added warn-level context around fetch_contract_if_missing failures so we don’t lose the underlying error when a contract can’t be hydrated (crates/core/src/operations/subscribe.rs:803).
Introduced peer_distance_or_default so both the “closest peers” and “k best peers” paths treat peers without a recorded location the same way instead of filtering them out only in one branch (crates/core/src/router/mod.rs).
Simplified the preserved-upstream hand-off in the PUT completion path using Option::or (crates/core/src/operations/put.rs:983).

On the potential recursion: when a seek request can’t find any candidates we attempt a single GET via get::start_op(*key, true, false) (crates/core/src/operations/subscribe.rs:615). That final false means the GET op never turns around and issues a subscription, so there’s no recursive subscribe→GET→subscribe loop. If the GET fails or the contract still isn’t present after wait_for_local_contract, we immediately return return_not_subbed() (crates/core/src/operations/subscribe.rs:615-707), so we never retry indefinitely.

Testing:

cargo test operations::subscribe
cargo test --test operations subscribe_path
`cargo test connectivity::test_gateway_reconnection

apps/freenet-ping/app/Cargo.toml

apps/freenet-ping/Cargo.toml

iduartgomez · 2025-11-09T12:57:42Z

crates/core/src/router/mod.rs

-            .filter_map(|peer| {
-                peer.location
-                    .map(|loc| (peer, target_location.distance(loc)))
+            .map(|peer| {


I have a dejavuh with this, already commented in a previous PR, let's not make upa. distance, filter_map will filter on collect peers which dont have a distance setup anyway, which in any case is not gonna happen in practice, unless is some sort of transient connection

[Codex] Thanks for flagging. I dropped the synthetic distance helper and restored the original filter_map behaviour so we only consider peers with a recorded location—matching the prior implementation and avoiding any made-up distances.

Why are we trying to route an update while the node has not joined the network? that sounds wrong.

@iduartgomez You mean if the node is not part of the subscription tree for that contract? The idea was that the update would be routed towards the contract in the hope that it would hit a subscribed node. Is that incorrect?

During bootstrap we still pass PeerKeyLocations without a real location (e.g. from connection_manager.own_location() before the node has joined), so dropping them leaves us with no routing candidates.

If the node has no real location is because it hasn't joined yet no? There is no other way.

I don't think is semantically correct to try execute operations while technically the peer hasn't even join the network successfully. Why not return a proper error in that case and let the caller handle it properly (with retries, waits, etc.).

We don't really need to be smart here, or do we? Is better to make that behavior EXPLICIT and fail fast. IMO.

[Codex] @iduartgomez Makes sense. Today the stack still runs PUT/SUBSCRIBE/etc. during that short pre-join window, so removing the fallback immediately regressed tests (routing had zero candidates). Switching to the behavior you describe means adding an explicit “peer not joined” error path through those ops and teaching callers/tests to retry once the join completes. I filed #2069 to track that cleanup so we can make the semantics explicit in a follow-up. For this PR I’d prefer to keep the fallback so the current stack stays stable, but happy to reprioritize that issue if you want to tackle the fail-fast change next.

@sanity I leave this at your discretion, but I am afraid this will create unexpected behavior I we keep it around for long, so if we decide to go ahead I think we should prioritize fixing the root cause in a different way and get rid of this fallback and things like peer_distance_or_default.

@iduartgomez How about if we fix it as soon as this PR stack gets merged? I've marked the issue high-priority. I'm just worried about how complex this PR stack already is and Codex thinks this change would touch a lot of stuff.

Sounds good.

iduartgomez · 2025-11-10T16:54:28Z

crates/core/src/operations/update.rs

I think fundamentally there is a problem with conflating network subscriptions and local client application subscriptions. IMO we should distinguish them and handle them differently via different flows (e.g. sending internal event notifications for local ones)

This problem has been permeating across multiple bug fixes and changes and looks like we should address the root cause instead of keep working around it.

iduartgomez · 2025-11-10T17:23:46Z

Taking an overall look at the PR I think we have successfully managed to couple local subscriptions/updates to network subscriptions/updates, which is ... NOT good. The goal long-term would be to separate them absolutely, we shouldn't have to worry about local connections inside ops/ module neither viceversa at the websocket/contract layer. This is what I have been trying to avoid a long time and now we are ending with more coupling instead of less and try to untangle that next.

Let's move forward with this, I know the next PR also reinforces coupling of the transport and our logical connect network operation, although that one is harder to solve, the handshake handler was supposed to isolate peers which are logically connected from the network to those which are not, but this PR stack has nuked that. Maybe some bugs were fixed along the way idk but seems like a pyrrhic victory for long term maintainability.

iduartgomez

Need to rebase on main for merge.

sanity requested a review from iduartgomez November 8, 2025 17:18

iduartgomez reviewed Nov 9, 2025

View reviewed changes

apps/freenet-ping/app/Cargo.toml Show resolved Hide resolved

iduartgomez reviewed Nov 9, 2025

View reviewed changes

apps/freenet-ping/Cargo.toml Show resolved Hide resolved

iduartgomez reviewed Nov 9, 2025

View reviewed changes

sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56

iduartgomez force-pushed the stack/update-fallback branch from c3fdf36 to e161690 Compare November 9, 2025 18:36

iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20

Base automatically changed from stack/update-fallback to main November 9, 2025 19:30

sanity mentioned this pull request Nov 10, 2025

Fail fast when peer has no ring location #2069

Open

iduartgomez reviewed Nov 10, 2025

View reviewed changes

iduartgomez approved these changes Nov 10, 2025

View reviewed changes

sanity enabled auto-merge November 10, 2025 18:39

sanity added 7 commits November 10, 2025 12:45

fix: add update fallback propagation

b9d43f2

fix(update): keep local subscribers and add PUT ack tracing

f72b27c

fix(update): remove GET fallback propagation

6f7a90a

style(update): use direct equality in allow-self check

a3ce0c5

feat: harden subscription routing

6fa3d4b

fix(put): preserve upstream during broadcast

067b91f

fix(put): deliver SuccessfulPut directly to origin

0a1a2f5

sanity force-pushed the stack/subscription-routing branch from e1f507b to 0a1a2f5 Compare November 10, 2025 18:50

sanity mentioned this pull request Nov 10, 2025

Decouple local client subscriptions from network subscriptions #2075

Open

freenet deleted a comment from claude bot Nov 10, 2025

sanity added 2 commits November 10, 2025 13:59

fix(put): send direct ack to origin on final hop

75fc24e

fix(put): ensure broadcast hop acks original origin

422f5f5

Uh oh!

feat: harden subscription routing #2064

Are you sure you want to change the base?

feat: harden subscription routing #2064

Uh oh!

Conversation

sanity commented Nov 7, 2025

Summary

Testing

Uh oh!

sanity commented Nov 7, 2025

Uh oh!

claude bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete

Todo List

Summary

Key Improvements

Detailed Findings

1. Subscription Routing Logic (subscribe.rs)

2. Router Distance Defaults (router/mod.rs)

3. PUT Operation Changes (operations/put.rs)

4. Test Cleanup

Testing Recommendations

Minor Observations

Conclusion

Uh oh!

sanity commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanity Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iduartgomez commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iduartgomez left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Nov 7, 2025 •

edited

Loading

1. Subscription Routing Logic (`subscribe.rs`)

2. Router Distance Defaults (`router/mod.rs`)

3. PUT Operation Changes (`operations/put.rs`)

sanity Nov 10, 2025 •

edited

Loading

iduartgomez commented Nov 10, 2025 •

edited

Loading