Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 6b9cedc

Browse files
committed
Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors
Signed-off-by: Marco Pracucci <marco@pracucci.com>
1 parent 3528572 commit 6b9cedc

File tree

3 files changed

+35
-9
lines changed

3 files changed

+35
-9
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,12 @@
1717
* [CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
1818
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1919
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
20+
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
2021
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
2122
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
2223
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
2324
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
25+
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
2426
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
2527
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
2628
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -180,20 +180,20 @@
180180
},
181181
},
182182
{
183-
alert: 'CortexCacheRequestErrors',
183+
alert: 'CortexMemcachedRequestErrors',
184184
expr: |||
185-
100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
186-
/
187-
sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
188-
> 1
185+
(
186+
sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
187+
sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
188+
) * 100 > 5
189189
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
190-
'for': '15m',
190+
'for': '5m',
191191
labels: {
192192
severity: 'warning',
193193
},
194194
annotations: {
195195
message: |||
196-
Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
196+
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
197197
|||,
198198
},
199199
},

cortex-mixin/docs/playbooks.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -414,9 +414,33 @@ _TODO: this playbook has not been written yet._
414414

415415
_TODO: this playbook has not been written yet._
416416

417-
### CortexCacheRequestErrors
417+
### CortexMemcachedRequestErrors
418418

419-
_TODO: this playbook has not been written yet._
419+
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
420+
421+
How to **investigate**:
422+
- The alert reports which cache is experiencing issue
423+
- `metadata-cache`: object store metadata cache
424+
- `index-cache`: TSDB index cache
425+
- `chunks-cache`: TSDB chunks cache
426+
- Check which specific error is occurring
427+
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
428+
```
429+
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
430+
```
431+
- Based on the **`reason`**:
432+
- `timeout`
433+
- Scale up the memcached replicas
434+
- `server-error`
435+
- Check both Cortex and memcached logs to find more details
436+
- `network-error`
437+
- Check Cortex logs to find more details
438+
- `malformed-key`
439+
- The key is too long or contains invalid characters
440+
- Check Cortex logs to find the offending key
441+
- Fixing this will require changes to the application code
442+
- `other`
443+
- Check both Cortex and memcached logs to find more details
420444
421445
### CortexOldChunkInMemory
422446

0 commit comments

Comments
 (0)