Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Summary

  • ensure subscribes fetch contract code and log candidate peers
  • bump stdlib + freenet-ping deps and update gateway script
  • adjust router distance defaults and request density tracing

Testing

  • cargo test operations::subscribe
  • cargo test --test operations subscribe_path
  • cargo test connectivity::test_gateway_reconnection

@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

@claude Please review PR

@claude
Copy link

claude bot commented Nov 7, 2025

Claude finished @sanity's task —— View job


PR Review Complete

Todo List

  • Read all changed files
  • Review main subscription routing changes (subscribe.rs)
  • Review router distance and tracing changes
  • Check for bugs, security issues, and performance problems
  • Provide comprehensive feedback

Summary

This PR significantly enhances subscription routing reliability by adding contract code fetching, improved candidate peer logging, and router distance defaults adjustments. The changes are generally well-structured and improve the robustness of subscription operations.

Key Improvements

  1. Contract Code Fetching: Added fetch_contract_if_missing() to ensure contract code is available before completing subscriptions
  2. Enhanced Logging: Extensive tracing additions for debugging subscription routing
  3. Router Distance Handling: Fixed edge cases where peers without locations could cause issues
  4. Test Output Cleanup: Replaced println! with debug! for cleaner test output

Detailed Findings

1. Subscription Routing Logic (subscribe.rs)

Strengths:

  • Excellent addition of fetch_contract_if_missing() to ensure contract code is present (lines 54-72)
  • Good use of wait_for_local_contract() with timeout/polling for async contract availability (lines 38-52)
  • Improved handling of local subscription scenarios with complete_local_subscription() helper (lines 258-288)
  • Enhanced skip list management prevents routing loops
  • Better HTL (hops-to-live) validation and bounds checking (lines 520-521, 536-543)

Issues & Suggestions:

🔴 High Priority - Potential Infinite Recursion
crates/core/src/operations/subscribe.rs:564-580
When no candidates are found for forwarding, the code attempts to fetch the contract locally using GET operation. However, if the GET also fails to find the contract, this could potentially lead to cascading failures. Consider:

// Current approach might retry indefinitely
let get_op = get::start_op(*key, true, false);
if let Err(fetch_err) = get::request_get(op_manager, get_op, HashSet::new()).await

Recommendation: Add a retry limit or context flag to prevent GET operations from triggering additional subscribe operations.

🟡 Medium Priority - Error Handling Inconsistency
crates/core/src/operations/subscribe.rs:771
The fetch_contract_if_missing() call uses ? operator which will propagate errors up, but there's no logging at this level:

fetch_contract_if_missing(op_manager, *key).await?;

Recommendation: Add error logging before propagation:

if let Err(e) = fetch_contract_if_missing(op_manager, *key).await {
    tracing::warn!(tx = %id, %key, error = %e, "Failed to fetch contract code");
    return Err(e);
}

🟡 Medium Priority - Magic Numbers
crates/core/src/operations/subscribe.rs:23-24
The timeout and polling interval constants lack documentation:

const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

Recommendation: Add documentation explaining the rationale for these values:

/// Maximum time to wait for a contract to become available locally after a GET request
const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
/// Interval between checks when polling for contract availability
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;

🟢 Low Priority - Verbose Logging
crates/core/src/operations/subscribe.rs:649-679
The subscribers_snapshot() calls before/after subscriber registration create significant log volume. While useful for debugging, this might be overwhelming in production.
Recommendation: Consider using debug! instead of info! for the before/after snapshots, or add a feature flag for verbose subscription logging.

🟢 Low Priority - Code Duplication
crates/core/src/operations/subscribe.rs:405-445 and crates/core/src/operations/subscribe.rs:649-679
Similar subscriber registration logic appears multiple times with slight variations.
Recommendation: Consider extracting into a helper function like register_subscriber_with_logging() to reduce duplication and improve maintainability.

2. Router Distance Defaults (router/mod.rs)

Strengths:

  • Good fix for handling peers without location data (lines 165-169)
  • Using Distance::new(0.5) as default is reasonable (middle of the ring)

Issues & Suggestions:

🟡 Medium Priority - Inconsistent Handling
crates/core/src/router/mod.rs:165-169 vs crates/core/src/router/mod.rs:206-212
Two different approaches for handling missing locations:

// In select_closest_peers: unwrap_or with default distance
.map(|peer| {
    let distance = peer.location
        .map(|loc| target_location.distance(loc))
        .unwrap_or_else(|| Distance::new(0.5));
    (peer, distance)
})

// In select_k_best_peers: filter_map that excludes peers without location
.filter_map(|peer| {
    peer.location.map(|loc| {
        let distance = target_location.distance(loc);
        (peer, distance)
    })
})

Recommendation: Decide on a consistent strategy. If peers without locations should be included with default distance, apply this uniformly. If they should be excluded, apply that uniformly. The current mixed approach could lead to unexpected behavior.

3. PUT Operation Changes (operations/put.rs)

Strengths:

  • Good fix for preserving upstream reference in broadcast scenarios (lines 888-896)

Issues & Suggestions:

🟢 Low Priority - Logic Complexity
crates/core/src/operations/put.rs:888-896
The preserved_upstream logic with or_else(|| Some(upstream.clone())) is somewhat convoluted:

let upstream_for_completion = preserved_upstream
    .clone()
    .or_else(|| Some(upstream.clone()));

Recommendation: Simplify to:

let upstream_for_completion = preserved_upstream.or(Some(upstream));

(No need for clone() after or_else if you use or)

4. Test Cleanup

Strengths:

  • Good practice replacing println! with debug! in tests
  • Maintains debuggability while keeping test output clean

Testing Recommendations

The PR description mentions three test commands. Please also consider:

  1. Stress Testing: Run subscription operations under high load with multiple concurrent subscribes
  2. Network Partition Tests: Test behavior when contract is unavailable and network is partitioned
  3. HTL Boundary Tests: Verify behavior at HTL=0 and HTL=1 edge cases
  4. GET-Subscribe Interaction: Test scenarios where GET operations are triggered from within subscribe operations

Minor Observations

  • Dependency bumps (stdlib 0.1.14 → 0.1.22) look reasonable
  • Script changes are minor cleanup improvements
  • Overall code quality is high with good error handling

Conclusion

This is a solid PR that meaningfully improves subscription routing robustness. The main concerns are:

  1. Potential for cascading failures in GET fallback logic
  2. Inconsistent peer location handling in router
  3. Some opportunities for code simplification

The extensive logging additions are valuable for debugging but consider tunability for production environments.

Recommendation: Approve with minor revisions suggested above.


@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

Took another pass over Claude’s feedback:

  • Documented the local wait constants and replaced the repetitive info! snapshots with register_subscriber_with_logging, which keeps the before/after diagnostics but at debug! level and avoids duplicating the logic (crates/core/src/operations/subscribe.rs).
  • Added warn-level context around fetch_contract_if_missing failures so we don’t lose the underlying error when a contract can’t be hydrated (crates/core/src/operations/subscribe.rs:803).
  • Introduced peer_distance_or_default so both the “closest peers” and “k best peers” paths treat peers without a recorded location the same way instead of filtering them out only in one branch (crates/core/src/router/mod.rs).
  • Simplified the preserved-upstream hand-off in the PUT completion path using Option::or (crates/core/src/operations/put.rs:983).

On the potential recursion: when a seek request can’t find any candidates we attempt a single GET via get::start_op(*key, true, false) (crates/core/src/operations/subscribe.rs:615). That final false means the GET op never turns around and issues a subscription, so there’s no recursive subscribe→GET→subscribe loop. If the GET fails or the contract still isn’t present after wait_for_local_contract, we immediately return return_not_subbed() (crates/core/src/operations/subscribe.rs:615-707), so we never retry indefinitely.

Testing:

  • cargo test operations::subscribe
  • cargo test --test operations subscribe_path
  • `cargo test connectivity::test_gateway_reconnection

@sanity sanity requested a review from iduartgomez November 8, 2025 17:18
.filter_map(|peer| {
peer.location
.map(|loc| (peer, target_location.distance(loc)))
.map(|peer| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a dejavuh with this, already commented in a previous PR, let's not make upa. distance, filter_map will filter on collect peers which dont have a distance setup anyway, which in any case is not gonna happen in practice, unless is some sort of transient connection

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] Thanks for flagging. I dropped the synthetic distance helper and restored the original filter_map behaviour so we only consider peers with a recorded location—matching the prior implementation and avoiding any made-up distances.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we trying to route an update while the node has not joined the network? that sounds wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iduartgomez You mean if the node is not part of the subscription tree for that contract? The idea was that the update would be routed towards the contract in the hope that it would hit a subscribed node. Is that incorrect?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During bootstrap we still pass PeerKeyLocations without a real location (e.g. from connection_manager.own_location() before the node has joined), so dropping them leaves us with no routing candidates.

If the node has no real location is because it hasn't joined yet no? There is no other way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think is semantically correct to try execute operations while technically the peer hasn't even join the network successfully. Why not return a proper error in that case and let the caller handle it properly (with retries, waits, etc.).

We don't really need to be smart here, or do we? Is better to make that behavior EXPLICIT and fail fast. IMO.

Copy link
Collaborator Author

@sanity sanity Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] @iduartgomez Makes sense. Today the stack still runs PUT/SUBSCRIBE/etc. during that short pre-join window, so removing the fallback immediately regressed tests (routing had zero candidates). Switching to the behavior you describe means adding an explicit “peer not joined” error path through those ops and teaching callers/tests to retry once the join completes. I filed #2069 to track that cleanup so we can make the semantics explicit in a follow-up. For this PR I’d prefer to keep the fallback so the current stack stays stable, but happy to reprioritize that issue if you want to tackle the fail-fast change next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanity I leave this at your discretion, but I am afraid this will create unexpected behavior I we keep it around for long, so if we decide to go ahead I think we should prioritize fixing the root cause in a different way and get rid of this fallback and things like peer_distance_or_default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iduartgomez How about if we fix it as soon as this PR stack gets merged? I've marked the issue high-priority. I'm just worried about how complex this PR stack already is and Codex thinks this change would touch a lot of stuff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@sanity sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56
@iduartgomez iduartgomez force-pushed the stack/update-fallback branch from c3fdf36 to e161690 Compare November 9, 2025 18:36
@iduartgomez iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20
Base automatically changed from stack/update-fallback to main November 9, 2025 19:30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fundamentally there is a problem with conflating network subscriptions and local client application subscriptions. IMO we should distinguish them and handle them differently via different flows (e.g. sending internal event notifications for local ones)

This problem has been permeating across multiple bug fixes and changes and looks like we should address the root cause instead of keep working around it.

@iduartgomez
Copy link
Collaborator

iduartgomez commented Nov 10, 2025

Taking an overall look at the PR I think we have successfully managed to couple local subscriptions/updates to network subscriptions/updates, which is ... NOT good. The goal long-term would be to separate them absolutely, we shouldn't have to worry about local connections inside ops/ module neither viceversa at the websocket/contract layer. This is what I have been trying to avoid a long time and now we are ending with more coupling instead of less and try to untangle that next.

Let's move forward with this, I know the next PR also reinforces coupling of the transport and our logical connect network operation, although that one is harder to solve, the handshake handler was supposed to isolate peers which are logically connected from the network to those which are not, but this PR stack has nuked that. Maybe some bugs were fixed along the way idk but seems like a pyrrhic victory for long term maintainability.

Copy link
Collaborator

@iduartgomez iduartgomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to rebase on main for merge.

@sanity sanity enabled auto-merge November 10, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants