-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Open
Description
Summary
When stopping a node or adding/removing peers in a Raft cluster, the node occasionally panics (nil pointer or send on closed channel). This looks like a race between Raft shutdown goroutines and consensus state workers. Similar symptoms have been reported in panic on shutdown and peer add failures.
Steps to Reproduce
- Start a Raft cluster (3–6 nodes).
- Run transactions under load.
- Repeatedly add/remove peers or gracefully stop nodes.
- Occasionally observe panic stacktraces and node crash.
Expected Behavior
- Node should shut down or handle peer changes gracefully without panics.
- Errors should be surfaced, not crash the process.
Actual Behavior
- Panic with nil-pointer deref or send on closed channel, leading to node exit.
- Intermittent, triggered under load or frequent membership changes.
Suspected Cause
- Race condition in
raft/consensus shutdown sequence. - Goroutines still accessing consensus state after
Close()closes channels or DB. - Peer-add code may race with shutdown.
Suggested Fix
- Add
shutdownflag + locking to guard lifecycle transitions. - Replace direct channel closes with context cancellation.
- Run under
-raceto identify exact data-races. - Ensure shutdown order: stop net handlers → stop consensus workers → close DB/channels.
Impact
- Production Raft clusters risk instability and crash during routine maintenance.
- Can cause loss of quorum if validator crashes.
Metadata
Metadata
Assignees
Labels
No labels