-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
In large clusters (N nodes, N > 100-150), there is the following metastable failure mode that results in:
- a node seeing most other nodes as non-live
- correspondingly, reporting most ranges as unavailable
At the core of liveness gossip updates, there is the loop that processes "callbacks" sequentially (one thread per key "prefix", so a single one for "liveness" prefix). The liveness update callback ultimately calls into the nodeIsLiveCallback whenever the update moves any node from a non-live to live status. The nodeIsLiveCallback then scans all replicas on the store, in order to potentially unquiesce them. Call it the "slow path" in the callback.
We observed that in large cluster (with many ranges), the nodeIsLiveCallback "slow path" can take tens of milliseconds (say 50ms). Thus, there is a limit of 1s/50ms = 20 of such "slow paths" that we can take per second.
When there is an intermittent blip on the liveness leaseholder, it is possible that one node briefly enters the state in which it sees every other node as non-live. In order to move out of this state, it has to observe N non-live->live transitions, and call the "slow path" N times, correspondingly.
However, the non-live->live transition (which node liveness updates via gossip signify) have an expiration time some 3-6s in the future. In these few seconds, we are only able to process 20*few updates, after which all remaining updates expire and become ineffective. Good news is that we don't take the "slow path" for these remaining updates and clear the queue quickly, but bad news is that we continue seeing most nodes as non-live.
This situation repeats on the each influx of node liveness updates each 6s, and the node never catches up to see everyone as live.
Fixing that can be done in a few ways:
- Avoid taking the "slow path" when leader leases are used. It appears that this path is only needed with epoch-based leases.
- Amortize the cost of the "slow path": instead of calling it on every single callback, call it once per batch/influx of updates.
- As a variant of (2), move the "slow path" from the critical path of gossip update processing and amortize.
Jira issue: CRDB-56309