Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit fe7bd13

Browse files
authored
Merge pull request #347 from grafana/ruler-alerts
Replace ruler alerts, and add playbooks.
2 parents 2137f76 + 53d67f5 commit fe7bd13

File tree

3 files changed

+45
-7
lines changed

3 files changed

+45
-7
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
1818
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1919
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
20+
* [CHANGE] Replace `CortexRulerFailedEvaluations` with two new alerts: `CortexRulerTooManyFailedPushes` and `CortexRulerTooManyFailedQueries`. #347
2021
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
2122
* [CHANGE] Removed `CortexQuerierCapacityFull` alert. #342
2223
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -512,20 +512,40 @@
512512
name: 'ruler_alerts',
513513
rules: [
514514
{
515-
alert: 'CortexRulerFailedEvaluations',
515+
alert: 'CortexRulerTooManyFailedPushes',
516516
expr: |||
517-
sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_evaluation_failures_total[1m]))
517+
100 * (
518+
sum by (%s, instance) (rate(cortex_ruler_write_requests_failed_total[1m]))
518519
/
519-
sum by (%s, instance, rule_group) (rate(cortex_prometheus_rule_evaluations_total[1m]))
520-
> 0.01
520+
sum by (%s, instance) (rate(cortex_ruler_write_requests_total[1m]))
521+
) > 1
522+
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
523+
'for': '5m',
524+
labels: {
525+
severity: 'critical',
526+
},
527+
annotations: {
528+
message: |||
529+
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% write (push) errors.
530+
|||,
531+
},
532+
},
533+
{
534+
alert: 'CortexRulerTooManyFailedQueries',
535+
expr: |||
536+
100 * (
537+
sum by (%s, instance) (rate(cortex_ruler_queries_failed_total[1m]))
538+
/
539+
sum by (%s, instance) (rate(cortex_ruler_queries_total[1m]))
540+
) > 1
521541
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
522542
'for': '5m',
523543
labels: {
524544
severity: 'warning',
525545
},
526546
annotations: {
527547
message: |||
528-
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% errors for the rule group {{ $labels.rule_group }}.
548+
Cortex Ruler {{ $labels.instance }} is experiencing {{ printf "%.2f" $value }}% errors while evaluating rules.
529549
|||,
530550
},
531551
},

cortex-mixin/docs/playbooks.md

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -146,9 +146,26 @@ More information:
146146

147147
This alert occurs when a ruler is unable to validate whether or not it should claim ownership over the evaluation of a rule group. The most likely cause is that one of the rule ring entries is unhealthy. If this is the case proceed to the ring admin http page and forget the unhealth ruler. The other possible cause would be an error returned the ring client. If this is the case look into debugging the ring based on the in-use backend implementation.
148148

149-
### CortexRulerFailedEvaluations
149+
### CortexRulerTooManyFailedPushes
150150

151-
_TODO: this playbook has not been written yet._
151+
This alert fires when rulers cannot push new samples (result of rule evaluation) to ingesters.
152+
153+
In general, pushing samples can fail due to problems with Cortex operations (eg. too many ingesters have crashed, and ruler cannot write samples to them), or due to problems with resulting data (eg. user hitting limit for number of series, out of order samples, etc.).
154+
This alert fires only for first kind of problems, and not for problems caused by limits or invalid rules.
155+
156+
How to **fix**:
157+
- Investigate the ruler logs to find out the reason why ruler cannot write samples. Note that ruler logs all push errors, including "user errors", but those are not causing the alert to fire. Focus on problems with ingesters.
158+
159+
### CortexRulerTooManyFailedQueries
160+
161+
This alert fires when rulers fail to evaluate rule queries.
162+
163+
Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are "user errors", and this alert ignores them.
164+
165+
There is a category of errors that is more important: errors due to failure to read data from store-gateways or ingesters. These errors would result in 500 when run from querier. This alert fires if there is too many of such failures.
166+
167+
How to **fix**:
168+
- Investigate the ruler logs to find out the reason why ruler cannot evaluate queries. Note that ruler logs rule evaluation errors even for "user errors", but those are not causing the alert to fire. Focus on problems with ingesters or store-gateways.
152169

153170
### CortexRulerMissedEvaluations
154171

0 commit comments

Comments
 (0)