Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 2137f76

Browse files
authored
Merge pull request #346 from grafana/playbook-for-CortexCacheRequestErrors
Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors
2 parents c92bec2 + d3ec6ed commit 2137f76

File tree

3 files changed

+35
-9
lines changed

3 files changed

+35
-9
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,13 @@
1717
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
1818
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1919
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
20+
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
2021
* [CHANGE] Removed `CortexQuerierCapacityFull` alert. #342
2122
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
2223
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
2324
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
2425
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
26+
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
2527
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
2628
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
2729
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -165,20 +165,20 @@
165165
},
166166
},
167167
{
168-
alert: 'CortexCacheRequestErrors',
168+
alert: 'CortexMemcachedRequestErrors',
169169
expr: |||
170-
100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
171-
/
172-
sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
173-
> 1
170+
(
171+
sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
172+
sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
173+
) * 100 > 5
174174
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
175-
'for': '15m',
175+
'for': '5m',
176176
labels: {
177177
severity: 'warning',
178178
},
179179
annotations: {
180180
message: |||
181-
Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
181+
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
182182
|||,
183183
},
184184
},

cortex-mixin/docs/playbooks.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -435,9 +435,33 @@ How to **investigate**:
435435
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
436436
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
437437

438-
### CortexCacheRequestErrors
438+
### CortexMemcachedRequestErrors
439439

440-
_TODO: this playbook has not been written yet._
440+
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
441+
442+
How to **investigate**:
443+
- The alert reports which cache is experiencing issue
444+
- `metadata-cache`: object store metadata cache
445+
- `index-cache`: TSDB index cache
446+
- `chunks-cache`: TSDB chunks cache
447+
- Check which specific error is occurring
448+
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
449+
```
450+
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
451+
```
452+
- Based on the **`reason`**:
453+
- `timeout`
454+
- Scale up the memcached replicas
455+
- `server-error`
456+
- Check both Cortex and memcached logs to find more details
457+
- `network-error`
458+
- Check Cortex logs to find more details
459+
- `malformed-key`
460+
- The key is too long or contains invalid characters
461+
- Check Cortex logs to find the offending key
462+
- Fixing this will require changes to the application code
463+
- `other`
464+
- Check both Cortex and memcached logs to find more details
441465
442466
### CortexOldChunkInMemory
443467

0 commit comments

Comments
 (0)