gossip: node liveness updates don't keep up

In large clusters (N nodes, N > 100-150), there is the following metastable failure mode that results in:

- a node seeing most other nodes as non-live
- correspondingly, reporting most ranges as unavailable

At the core of liveness gossip updates, there is the [loop](https://github.com/cockroachdb/cockroach/blob/83e74b77eb60ae497c6697cb6fd59e5160ce1b39/pkg/gossip/infostore.go#L274-L290) that processes "callbacks" sequentially (one thread per key "prefix", so a single one for "liveness" prefix). The liveness update callback ultimately [calls](https://github.com/cockroachdb/cockroach/blob/83e74b77eb60ae497c6697cb6fd59e5160ce1b39/pkg/kv/kvserver/store.go#L2391-L2395) into the `nodeIsLiveCallback` whenever the update [moves](https://github.com/cockroachdb/cockroach/blob/83e74b77eb60ae497c6697cb6fd59e5160ce1b39/pkg/kv/kvserver/liveness/liveness.go#L534-L538) any node from a non-live to live status. The `nodeIsLiveCallback` then [scans](https://github.com/cockroachdb/cockroach/blob/83e74b77eb60ae497c6697cb6fd59e5160ce1b39/pkg/kv/kvserver/store_raft.go#L823-L835) all replicas on the store, in order to potentially unquiesce them. Call it the "slow path" in the callback.

We observed that in large cluster (with many ranges), the `nodeIsLiveCallback` "slow path" can take tens of milliseconds (say 50ms). Thus, there is a limit of 1s/50ms = 20 of such "slow paths" that we can take per second.

When there is an intermittent blip on the liveness leaseholder, it is possible that one node briefly enters the state in which it sees every other node as non-live. In order to move out of this state, it has to observe N `non-live`->`live` transitions, and call the "slow path" N times, correspondingly.

However, the `non-live`->`live` transition (which node liveness updates via gossip signify) have an expiration time some 3-6s in the future. In these `few` seconds, we are only able to process `20*few` updates, after which all remaining updates expire and become ineffective. Good news is that we don't take the "slow path" for these remaining updates and clear the queue quickly, but bad news is that we continue seeing most nodes as non-live.

This situation repeats on the each influx of node liveness updates each 6s, and the node never catches up to see everyone as live.

---

Fixing that can be done in a few ways:

1. Avoid taking the "slow path" when leader leases are used. It appears that this path is only needed with epoch-based leases.
2. Amortize the cost of the "slow path": instead of calling it on every single callback, call it once per batch/influx of updates.
3. As a variant of (2), move the "slow path" from the critical path of gossip update processing and amortize.

Jira issue: CRDB-56309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gossip: node liveness updates don't keep up #157089

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

gossip: node liveness updates don't keep up #157089

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions