Skip to content

Panic / data-race during Raft node shutdown or peer add #1755

@manchain

Description

@manchain

Summary
When stopping a node or adding/removing peers in a Raft cluster, the node occasionally panics (nil pointer or send on closed channel). This looks like a race between Raft shutdown goroutines and consensus state workers. Similar symptoms have been reported in panic on shutdown and peer add failures.

Steps to Reproduce

  1. Start a Raft cluster (3–6 nodes).
  2. Run transactions under load.
  3. Repeatedly add/remove peers or gracefully stop nodes.
  4. Occasionally observe panic stacktraces and node crash.

Expected Behavior

  • Node should shut down or handle peer changes gracefully without panics.
  • Errors should be surfaced, not crash the process.

Actual Behavior

  • Panic with nil-pointer deref or send on closed channel, leading to node exit.
  • Intermittent, triggered under load or frequent membership changes.

Suspected Cause

  • Race condition in raft/ consensus shutdown sequence.
  • Goroutines still accessing consensus state after Close() closes channels or DB.
  • Peer-add code may race with shutdown.

Suggested Fix

  • Add shutdown flag + locking to guard lifecycle transitions.
  • Replace direct channel closes with context cancellation.
  • Run under -race to identify exact data-races.
  • Ensure shutdown order: stop net handlers → stop consensus workers → close DB/channels.

Impact

  • Production Raft clusters risk instability and crash during routine maintenance.
  • Can cause loss of quorum if validator crashes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions