Skip to content

Conversation

@cheatfate
Copy link
Contributor

@cheatfate cheatfate commented Oct 7, 2025

This is high level description of new syncing algorithm.

First of all lets define some terms.

  1. peerStatusCheckpoint - Peer's latest finalized Checkpoint reported via status request.
  2. peerStatusHead - Peer's latest head BlockId reported via status request.
  3. lastSeenCheckpoint - Its the latest finalized Checkpoint reported by our current set of peers, e.g. max(peerStatusCheckpoint.epoch).
  4. lastSeenHead - Its the latest head BlockId reported by our current set of peers, e.g. max(peerStatusHead.slot).
  5. finalizedDistance = lastSeenCheckpoint.epoch - dag.headState.finalizedCheckpoint.epoch.
  6. wallSyncDistance = beaconClock.now().slotOrZero - dag.head.slot.

Every peer we get from PeerPool will start its loop:

  1. Updates Peer status information if its too "old", and "old" depends on current situation:
    1.1. Update status information when forward syncing is active - every 10 * SECONDS_PER_SLOT seconds.
    1.2. Update status information every SECONDS_PER_SLOT period when peerStatusHead.slot.epoch - peerStatusCheckpoint.epoch >= 3 (which means that there is some period of non-finality).
    1.3. In all other cases node updates status information every 5 * SECONDS_PER_SLOT seconds.
  2. Perform some by root requests, where roots are received from sync_dag module. If finalizedDistance() < 4 epochs it will do:
    2.1. Request by root blocks in range of [PeerStatusCheckpoint.epoch.start_slot, PeerStatusHead.slot].
    2.2. Request by root sidecars in range [getForwardSidecarSlot(), PeerStatusHead.slot].
  3. If finalizedDistance() > 1 epochs it will do:
    3.1. Request by range blocks in range of [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
    3.2. Request by range sidecars in range [dag.finalizedHead.slot, lastSeenCheckpoint.epoch.start_slot].
  4. If node needs backfill process and if wallSyncDistance() < 1 (backfill process should not affect syncing status, so we pause backfill if node lost synced status) it will do:
    3.1. Request by range blocks in range of [dag.backfill.slot, getFrontfillSlot()].
    3.2. Request by range sidecars in range of [dag.backfill.slot, getBackfillSidecarSlot()].
  5. Do some pause (to avoid endless loops) which will do:
    5.1. In case when peer providing use with some information - no pause.
    5.2. In case when endless loop detected (for some unknown reason peer not provided any information) - 1.seconds pause.
    5.3. In case when we finished syncing - N seconds up to next slot.

Also new SyncOverseer catches number of EventBus events, so it could maintain sync_dag structures.

  1. Block from gossip monitoring loop. This event will be fired only when block comes from gossip.
  2. Block monitoring loop. This event will be fired for any block added to processor (blocks from gossip, blocks from proposer, blocks from sync).
  3. Finalization monitoring loop.

SyncManager and RequestManager got deprecated and removed from codebase.
The core problem of SyncManager is that it could work with BlobSidecars, but could not work with DataColumnSidecar. Because not all columns are available immediately, so it impossible to download blocks and columns in one step, like it was done in SyncManager.

Same problem exists in RequestManager, right now RequestManager when have missing parent just randomly selects 2 peers (without any filtering) and tries to download blocks and sidecars from this peers. If in BlobSidecar age it will work in most of the cases, in DataColumnSidecar age the probability of success is much more lower...

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Unit Test Results

       12 files  ±  0    2 440 suites  +8   49m 2s ⏱️ + 4m 50s
12 679 tests +  7  12 114 ✔️ +  7  565 💤 ±0  0 ±0 
63 720 runs  +28  62 992 ✔️ +28  728 💤 ±0  0 ±0 

Results for commit d52549a. ± Comparison against base commit 921edf8.

♻️ This comment has been updated with latest results.

@cheatfate cheatfate marked this pull request as draft October 8, 2025 11:59
Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help to change (parts of) holesky/sepolia/hoodi over to this branch?

Back for goerli/prater, I found this very helpful for testing, as merging to unstable (even with subsequent revert) was sketchy, but not having it deployed anywhere was also not very fruitful.

The status-im/infra-nimbus repo controls the branch that is used, and it is automatically rebuilt daily. One can pick the branch also for a subset of nodes (in around ~25% increments), and there is also a command to resync those nodes.

my scratchpad from goerli / holesky times, with instructions on how to connect to those servers, how to view the logs, how to restart them, and how to monitor their metrics:

FLEET:

Hostnames: https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&var-instance=geth-03.ih-eu-mda1.nimbus.holesky&var-container=beacon-node-holesky-testing&from=now-24h&to=now&refresh=15m

look at the instance/container dropdowns
the pattern should be fairly clear
then, to SSH to them, add .status.im

get a SSH access from jakub, tell him your SSH key (the correct half), and connect using -i the_other_half to etan@unstable-large-01.aws-eu-central-1a.nimbus.prater.statusim.net

> geth-01.ih-eu-mda1.nimbus.holesky.statusim.net   (was renamed to status.im)
  geth-01.ih-eu-mda1.nimbus.holesky.status.im

https://github.com/status-im/infra-nimbus/blob/0814b659654bb77f50aac7d456767b1794145a63/ansible/group_vars/all.yml#L23
sudo systemctl --no-block start build-beacon-node-holesky-unstable && journalctl -fu build-beacon-node-holesky-unstable

restart fleet

for a in {erigon,neth,geth}-{01..10}.ih-eu-mda1.nimbus.holesky.statusim.net; do ssh -o StrictHostKeychecking=no $a 'sudo systemctl --no-block start build-beacon-node-holesky-unstable'; done


tail -f /data/beacon-node-prater-unstable/logs/service.log

@jakubgs
Copy link
Member

jakubgs commented Oct 14, 2025

I've opened an issue for testing of this branch:

Please comment in it when you think the branch is ready for that.

@cheatfate cheatfate force-pushed the syncv3 branch 4 times, most recently from f155233 to d852cf1 Compare October 28, 2025 08:05
@cheatfate cheatfate marked this pull request as ready for review October 28, 2025 12:21
jakubgs added a commit to status-im/infra-nimbus that referenced this pull request Oct 30, 2025
Part of testing of new syncing algorithm:
#265
status-im/nimbus-eth2#7578

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@cheatfate cheatfate force-pushed the syncv3 branch 2 times, most recently from cd5bd3c to 6f70cd6 Compare November 4, 2025 12:31
@cheatfate cheatfate force-pushed the syncv3 branch 2 times, most recently from 790e300 to b667184 Compare December 1, 2025 11:08
More changes.

Add sync access to engine events.
Remove any changes to callbacks.

Add groupSidecars(DataColumnsByRootIdentifier).

Addressing some TODOs in overseer.

Add missing shortLog()

Replace one more TODO.

Addressing column intersection TODO.

Add event handlers.

Update events implementation.

Add some debugging logs.

Fix crash.

Fix some issues and add some more debug logging.

Move all the blocks received on `range` step to the BlockBuffer.

Upgrade BlockBuffer.

Fix assertion crash.

Fix how sidecars checking procedures.

Do fixes of byroot sync.

Fix compilation issues.

Add loop pause when there is no work to do.

Fix incremental math.

Make blobs and columns lists in logs smaller.
Remove some debugging log statements.

Actively reload how columns/blobs are logged.

Investigation of peer's endless loop issue.

Add performance meters.

Fix performance counter issues.

Add finalization event pruning.
Enable range syncing.

Post rebase fixes.

Add more conditions to peerPause.
Fix assertion crash.
Add overseer debug statistics.

Fix missing dag access.

Add earliest_available_slot handling.
Enable all the modes.
Add heuristic infinite-loop detection handler.

Add some debugging logs.

Fix compilation issue.

Add more debugging statements.

Move debugging statements.

Add some more information to debug logging.

Change SyncQueue[T].push method to return number of slots advanced.

Add debug information about current checkpoints stored in dag.

Restore inclusion proof verifications.

Add async control to block buffer.

Do not enter block range downloading in case when block buffer is almost full.

Add more debugging on RangeBuffer.

Use RangeBuffer shortLog.

Fix block buffer advance when empty responses being processed.

Fix sync_queue cyrillic C characters.

Remove block_buffer asynchronous handlers.

Removal of Checkpoints from SyncDag, maintain single Queues structure.

Removal of checkpoints part 2.

Fix pruning for blockBuffer and blobQuarantine.

Add more debugging statements.

Add logs for investigation lighthouse issue with range response.

Fix getSidecarSlot().
Add inpSlot to shortLog(SyncQueue).

More changes in getSidecarSlot().

Address runtime crash.

Make SyncQueue return negative integer when rewind is happen.
Adjust SyncQueue tests.
Add earliest_available_slot logging.

Address all the warnings.

Validate early empty sidecar responses.

Add more debugging output.

More debugging output.

Add SyncQueue synchronization after rewinds.
Refactor doRangeSidecarStep.

Make block_buffer accept blocks before initSlot.

VerifierError.MissingSidecars should not affect failures count.

Disable byRoot syncing while rangeSync is active.

Post-rebase fixes.

Fix peer management in Overseer.

Update pause detector.

Fix block_buffer.peekRange() returns incorrect number of blocks.
Add test.

Fix compilation.

More fixes to block_buffer.peekRange().

Add parent_root into slimLog(blocks).

Add SyncQueue synchronization for blocks loop.

Fix sidecars step should not be active when sidecars are not needed.
Fix rewinds for blocks step.

Sidecars check should be done before request has been made.

Remove initSlot from BlockRangeBuffer.
Fix updateQueues(), eliminate dups.
Fix getBlockBlobsMap().

Fix blocks queue should not rewind sidecars queue, if its not running yet.

Add SyncPushResponse result for SyncQueue.push().
Adopt tests for it.
Fix maybeFinalized = true for sidecars step.
Replace SyncBlock -> BlockId in SyncQueue.

Add one more step in debugging MissingParent error returned by BlockProcessor.

Add more debugging statements to SyncQueue.

Make requests non-relevant more strict.

Add more debug statements to verifiers.

Fix compilation issue.

Add blob index checking to response utils.
Add some debug statements into overseer.
Make MissingSidecars error strict.

Disable blob/column quarantine pruning in sidecars step.

Add blob_quarantine logging.
Disable blob_quarantine pruning.

Disable rewind syncing for blocks step.

Add blob/column quarantine pruning for failing/empty requests.
Fix sidecar queue syncing with blocks queue process.

Store blobs/columns in quarantine right before pushing request to avoid one more leak step.

Dissect ColumnMap from blob_quarantine to its own module.

Move BlockBuffer tests to test suite.
Add invalidate() function to BlockBuffer and tests.

Add BlockBuffer invalidation.

Update backfill queues in updateQueues().

Post-rebase fixes.

MissingSidecars should not affect rewinds.

Add more debugging values to overseer.

Remove sync_dag debugging logs.

Start root sync earlier.

Fix issue with block validation response check.

Update performance counters.

Fix crash.

Remove code duplicates from performance counters.

Some fixes for roots syncing.

Add peerLog logging.

Fix sidecars syncer conditions.
Remove some debugging log statements.

Remove peer_log.

Remove some debugging log statements.

Missing sidecars helper functions.

Simplify getMissingSidecarIndices(columns).

Add missing sidecar indices to logs, so it possible to track columns progress.

Fix test_quarantine.

Add some columns debugging statements.

Attempt to fix weird chronicles assertion.

Fix column distribution and rate logging.

Fix `You should not pop so many requests` assertion crash and start using PeerEntry's column map.

Add quarantine shortLog to check what is happening.

Add shortLog(columns).

Do not request columns if we already have it.

Fix new columns calculations.

One more fix.

Optimize getMissingSidecarIndices() and introduce getMissingColumnsMap() to blob_quarantine.
Add incl()/excl() functions to ColumnMap.
Fix peer columns detection logic in doRangeSidecarStep().

Some updates to blob_quarantine.
Refactoring doPeerUpdateRootsSidecars().

Add SyncDag path to main debug log statement.

Investigating blobs in columns age, more logs and fixes.

Still unclear where columns are lost.

Remove blob quarantine processing after finalization.

Fix: Do not remove blobs/columns on MissingSidecars/MissingParent errors.

Fix compilation issue.

Log full root map to understand why there missing blocks.
Log when anonymous gossip messages incoming.
Log blocks and sidecars by root differently.

Fix use sidecarless quarantine as source of blocks too.
Fix missingSidecars flag calculation in Gossip event handler.

Fix BlockBuffer not properly handles MissingParent detection.

Fix not-in-range detection for sidecars queue.

Fix compilation problem.

Enable earliest_avalailable_slot check.

Add peer_map to by root sidecars requests.

Earliest available slot is only for columns, not blocks.

Post-rebase fixes.

Add MissingSidecars cleanups.
Add SyncDag pruning on finalized epoch change.

Fix pruning errors.

Proper backfill check.

One more fix to backfill detection algorithm.

Update backfill queue limits calculation.

Fix compilation error.

One more calculation update.

Refactor sidecars queue limits calculation methods.

Add edge cases handling.

Make one edge-case non-fatal to avoid syncing being stuck.

Post-rebase fixes.

Fix block monitoring event loop.
Add maintenance loop to keep block buffers properly cleaned up.

Increase period when rootsync starts.

Fix SECONDS_PER_SLOT issue.

Tuning getStatusPeriod() function to be more precise in edge-cases.

Remove unnecessary helpers.

Do some BlockBuffer adding refactoring and fixing tests.

Update AllTests.

SLOT_DURATION.

Fix REST handlers.

Fix peer's status stale information should not appear too early.

Initialize SyncDag with genesis root/slot to avoid downloading.

Genesis is special case not an ordinary root.

One more place for genesis handling.

Another one more genesis fix.

Fix crash when backfill is not needed.

Post-rebase fixes.

Eliminate warnings.

Post-rebase fixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants