Skip to content

Commit a5660a6

Browse files
craig[bot]dodeca12
andcommitted
Merge #156830
156830: storeliveness: smear storeliveness heartbeat messages to reduce goroutine spikes at heartbeat interval tick r=miraradeva,iskettaneh a=dodeca12 This PR introduces heartbeat smearing logic that batches and smears Store Liveness heartbeat sends across destination nodes to prevent thundering herd of goroutine spikes. ### Changes Core changes are within these files: ```sh pkg/kv/kvserver/storeliveness ├── support_manager.go # Rename SendAsync→EnqueueMessage, add smearing settings └── transport.go # Add smearing sender goroutine to transport.go which takes care of smearing when enabled ``` ### Background Previously, all stores in a cluster sent heartbeats immediately at each heartbeat interval tick. In large clusters with multi-store nodes, this created synchronized bursts of goroutine spikes that caused issues in other parts of the running CRDB process. ### Commits **Commit: Introduce heartbeat smearing** - Adds a smearing sender goroutine to `transport.go` that batches enqueued messages - Smears send signals across queues using `taskpacer` to spread traffic over time - Configurable via cluster settings (default: enabled) **How it works:** 1. Messages are enqueued via `EnqueueMessage()` into per-node queues 2. When `SendAllEnqueuedMessages()` is called, transport's smearing sender goroutine waits briefly to batch messages 3. Transport's smearing sender goroutine uses `taskpacer` to pace signaling to each queue over a smear duration 4. Each `processQueue` goroutine drains its queue and sends when signalled ### New Cluster Settings - `kv.store_liveness.heartbeat_smearing.enabled` (default: true) - Enable/disable smearing - `kv.store_liveness.heartbeat_smearing.refresh` (default: 10ms) - Batching window duration - `kv.store_liveness.heartbeat_smearing.smear` (default: 1ms) - Time to spread sends across queues ### Backward Compatibility - Feature is disabled by setting `kv.store_liveness.heartbeat_smearing.enabled=false` - When disabled, messages are sent immediately via the existing path (non-smearing mode) ### Testing - Added comprehensive unit tests verifying: - Messages batch correctly across multiple destinations - Smearing spreads signals over the configured time window - Smearing mode vs immediate mode behaviour - Message ordering and reliability All existing tests updated to call `SendAllEnqueuedMessages()` after enqueuing when smearing is enabled. #### Roachprod testing ##### Prototype #1 - Ran a prototype with a [similar design](#154942) (TL;DR of prototype, the heartbeats were smeared via `SupportManager` goroutines being put to sleep; this current design ensures that `SupportManager` goroutines do not get blocked) on a roachprod with 150 node cluster to verify smearing works. | Before changes (current behaviour on master) | After changes (prototype) | |--------|--------| | <img width="2680" height="570" alt="image" src="https://github.com/user-attachments/assets/32fe6ee0-437f-48eb-b3f1-087a3eafe6ac" /> | <img width="2692" height="634" alt="image" src="https://github.com/user-attachments/assets/66b5b82b-bbc4-4f47-a13e-5f6d42a1c6d4" /> | ##### Current changes - Ran a roachprod test with current changes but without the check for empty queues (more info - https://reviewable.io/reviews/cockroachdb/cockroach/156378#-). This check did end up proving vital, as the test results didn't show the expected smearing behaviour. - Ran a mini-roachprod test on [this prototype commit](https://github.com/cockroachdb/cockroach/pull/155317/files#diff-9282b4b1d9a2fe32fae81e5776eb081e58069b4bc7db76718873b75f026e16c1) (where the only real difference between my changes is the inclusion of the length check on the queues that have messages on that commit) showed expected smearing behaviour. <img width="1797" height="469" alt="image" src="https://github.com/user-attachments/assets/bd7778ef-9f8d-4dbf-8ed2-dac40e7fb03c" /> Fixes: #148210 Release note: None Co-authored-by: Swapneeth Gorantla <swapneeth.gorantla@cockroachlabs.com>
2 parents 45f64d0 + 824bdef commit a5660a6

File tree

6 files changed

+628
-88
lines changed

6 files changed

+628
-88
lines changed

pkg/kv/kvserver/store_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ func createTestStoreWithoutStart(
248248
supportGracePeriod := rpcContext.StoreLivenessWithdrawalGracePeriod()
249249
options := storeliveness.NewOptions(livenessInterval, heartbeatInterval, supportGracePeriod)
250250
transport, err := storeliveness.NewTransport(
251-
cfg.AmbientCtx, stopper, cfg.Clock, cfg.NodeDialer, grpcServer, drpcServer, nil, /* knobs */
251+
cfg.AmbientCtx, stopper, cfg.Clock, cfg.NodeDialer, grpcServer, drpcServer, cfg.Settings, nil, /* knobs */
252252
)
253253
require.NoError(t, err)
254254
knobs := cfg.TestingKnobs.StoreLivenessKnobs

pkg/kv/kvserver/storeliveness/BUILD.bazel

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,12 @@ go_library(
3030
"//pkg/util/envutil",
3131
"//pkg/util/hlc",
3232
"//pkg/util/log",
33+
"//pkg/util/metamorphic",
3334
"//pkg/util/metric",
3435
"//pkg/util/protoutil",
3536
"//pkg/util/stop",
3637
"//pkg/util/syncutil",
38+
"//pkg/util/taskpacer",
3739
"//pkg/util/timeutil",
3840
"@com_github_cockroachdb_errors//:errors",
3941
"@com_github_cockroachdb_redact//:redact",

0 commit comments

Comments
 (0)