Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit f83ae25

Browse files
authored
Merge pull request #338 from grafana/playbook-for-request-errors
Add playbook for CortexRequestErrors and config option to exclude specific routes
2 parents b32d042 + 967ab57 commit f83ae25

File tree

4 files changed

+32
-10
lines changed

4 files changed

+32
-10
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
2020
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
2121
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
22+
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
2223
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
2324
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
2425
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,14 @@
2121
// Note if alert_aggregation_labels is "job", this will repeat the label. But
2222
// prometheus seems to tolerate that.
2323
expr: |||
24-
100 * sum by (%s, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",route!~"ready"}[1m]))
24+
100 * sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",route!~"%(excluded_routes)s"}[1m]))
2525
/
26-
sum by (%s, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready"}[1m]))
26+
sum by (%(group_by)s, job, route) (rate(cortex_request_duration_seconds_count{route!~"%(excluded_routes)s"}[1m]))
2727
> 1
28-
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
28+
||| % {
29+
group_by: $._config.alert_aggregation_labels,
30+
excluded_routes: std.join('|', ['ready'] + $._config.alert_excluded_routes),
31+
},
2932
'for': '15m',
3033
labels: {
3134
severity: 'critical',
@@ -39,10 +42,18 @@
3942
{
4043
alert: 'CortexRequestLatency',
4144
expr: |||
42-
%(group_prefix_jobs)s_route:cortex_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|ready|/schedulerpb.SchedulerForFrontend/FrontendLoop|/schedulerpb.SchedulerForQuerier/QuerierLoop"}
45+
%(group_prefix_jobs)s_route:cortex_request_duration_seconds:99quantile{route!~"%(excluded_routes)s"}
4346
>
4447
%(cortex_p99_latency_threshold_seconds)s
45-
||| % $._config,
48+
||| % $._config {
49+
excluded_routes: std.join('|', [
50+
'metrics',
51+
'/frontend.Frontend/Process',
52+
'ready',
53+
'/schedulerpb.SchedulerForFrontend/FrontendLoop',
54+
'/schedulerpb.SchedulerForQuerier/QuerierLoop',
55+
] + $._config.alert_excluded_routes),
56+
},
4657
'for': '15m',
4758
labels: {
4859
severity: 'warning',

cortex-mixin/config.libsonnet

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,5 +64,8 @@
6464
writes: true,
6565
reads: true,
6666
},
67+
68+
// The routes to exclude from alerts.
69+
alert_excluded_routes: [],
6770
},
6871
}

cortex-mixin/docs/playbooks.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,18 @@ Right now most of the execution time will be spent in PromQL's innerEval. NB tha
109109

110110
### CortexRequestErrors
111111

112-
_TODO: this playbook has not been written yet._
112+
This alert fires when the rate of 5xx errors of a specific route is > 1% for some time.
113+
114+
This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one.
115+
116+
How to **investigate**:
117+
- Check for which route the alert fired
118+
- Write path: open the `Cortex / Writes` dashboard
119+
- Read path: open the `Cortex / Reads` dashboard
120+
- Looking at the dashboard you should see in which Cortex service the error originates
121+
- The panels in the dashboard are vertically sorted by the network path (eg. on the write path: cortex-gw -> distributor -> ingester)
122+
- If the failing service is going OOM (`OOMKilled`): scale up or increase the memory
123+
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there
113124

114125
### CortexTransferFailed
115126
This alert goes off when an ingester fails to find another node to transfer its data to when it was shutting down. If there is both a pod stuck terminating and one stuck joining, look at the kubernetes events. This may be due to scheduling problems caused by some combination of anti affinity rules/resource utilization. Adding a new node can help in these circumstances. You can see recent events associated with a resource via kubectl describe, ex: `kubectl -n <namespace> describe pod <pod>`
@@ -355,10 +366,6 @@ WAL corruptions are only detected at startups, so at this point the WAL/Checkpoi
355366
2. Equal or more than the quorum number but less than replication factor: There is a good chance that there is no data loss if it was replicated to desired number of ingesters. But it's good to check once for data loss.
356367
3. Equal or more than the replication factor: Then there is definitely some data loss.
357368

358-
### CortexRequestErrors
359-
360-
_TODO: this playbook has not been written yet._
361-
362369
### CortexTableSyncFailure
363370

364371
_This alert applies to Cortex chunks storage only._

0 commit comments

Comments
 (0)