Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 344fce1

Browse files
authored
Merge pull request #335 from grafana/playbooks-for-config-alerts
Fixed and improved runtime config alerts and playbooks
2 parents 8817fc8 + 9cb3da5 commit 344fce1

File tree

3 files changed

+32
-18
lines changed

3 files changed

+32
-18
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,12 @@
1414
* [CHANGE] Ingester/Ruler: set `-server.grpc-max-send-msg-size-bytes` and `-server.grpc-max-send-msg-size-bytes` to sensible default values (10MB). #326
1515
* [CHANGE] Renamed `CortexCompactorHasNotUploadedBlocksSinceStart` to `CortexCompactorHasNotUploadedBlocks`. #334
1616
* [CHANGE] Renamed `CortexCompactorRunFailed` to `CortexCompactorHasNotSuccessfullyRunCompaction`. #334
17+
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
18+
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1719
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
1820
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
1921
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
22+
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
2023

2124
## 1.9.0 / 2021-05-18
2225

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -92,39 +92,30 @@
9292
},
9393
},
9494
{
95-
alert: 'CortexInconsistentConfig',
95+
alert: 'CortexInconsistentRuntimeConfig',
9696
expr: |||
97-
count(count by(%s, job, sha256) (cortex_config_hash)) without(sha256) > 1
97+
count(count by(%s, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1
9898
||| % $._config.alert_aggregation_labels,
9999
'for': '1h',
100100
labels: {
101-
severity: 'warning',
101+
severity: 'critical',
102102
},
103103
annotations: {
104104
message: |||
105-
An inconsistent config file hash is used across cluster {{ $labels.job }}.
105+
An inconsistent runtime config file is used across cluster {{ $labels.job }}.
106106
|||,
107107
},
108108
},
109109
{
110-
// As of https://github.com/cortexproject/cortex/pull/2092, this metric is
111-
// only exposed when it is supposed to be non-zero, so we don't need to do
112-
// any special filtering on the job label.
113-
// The metric itself was renamed in
114-
// https://github.com/cortexproject/cortex/pull/2874
115-
//
116-
// TODO: Remove deprecated metric name of
117-
// cortex_overrides_last_reload_successful in the future
118110
alert: 'CortexBadRuntimeConfig',
119111
expr: |||
112+
# The metric value is reset to 0 on error while reloading the config at runtime.
120113
cortex_runtime_config_last_reload_successful == 0
121-
or
122-
cortex_overrides_last_reload_successful == 0
123114
|||,
124115
// Alert quicker for human errors.
125116
'for': '5m',
126117
labels: {
127-
severity: 'warning',
118+
severity: 'critical',
128119
},
129120
annotations: {
130121
message: |||

cortex-mixin/docs/playbooks.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -367,13 +367,33 @@ _TODO: this playbook has not been written yet._
367367

368368
_TODO: this playbook has not been written yet._
369369

370-
### CortexInconsistentConfig
370+
### CortexInconsistentRuntimeConfig
371371

372-
_TODO: this playbook has not been written yet._
372+
This alert fires if multiple replicas of the same Cortex service are using a different runtime config for a longer period of time.
373+
374+
The Cortex runtime config is a config file which gets live reloaded by Cortex at runtime. In order for Cortex to work properly, the loaded config is expected to be the exact same across multiple replicas of the same Cortex service (eg. distributors, ingesters, ...). When the config changes, there may be short periods of time during which some replicas have loaded the new config and others are still running on the previous one, but it shouldn't last for more than few minutes.
375+
376+
How to **investigate**:
377+
- Check how many different config file versions (hashes) are reported
378+
```
379+
count by (sha256) (cortex_runtime_config_hash{namespace="<namespace>"})
380+
```
381+
- Check which replicas are running a different version
382+
```
383+
cortex_runtime_config_hash{namespace="<namespace>",sha256="<unexpected>"}
384+
```
385+
- Check if the runtime config has been updated on the affected replicas' filesystem. Check `-runtime-config.file` command line argument to find the location of the file.
386+
- Check the affected replicas logs and look for any error loading the runtime config
373387

374388
### CortexBadRuntimeConfig
375389

376-
_TODO: this playbook has not been written yet._
390+
This alert fires if Cortex is unable to reload the runtime config.
391+
392+
This typically means an invalid runtime config was deployed. Cortex keeps running with the previous (valid) version of the runtime config; running Cortex replicas and the system availability shouldn't be affected, but new replicas won't be able to startup until the runtime config is fixed.
393+
394+
How to **investigate**:
395+
- Check the latest runtime config update (it's likely to be broken)
396+
- Check Cortex logs to get more details about what's wrong with the config
377397

378398
### CortexQuerierCapacityFull
379399

0 commit comments

Comments
 (0)