You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,11 +17,13 @@
17
17
*[CHANGE] Renamed `CortexInconsistentConfig` alert to `CortexInconsistentRuntimeConfig` and increased severity to `critical`. #335
18
18
*[CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
19
19
*[CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
20
+
*[CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
*[ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
22
23
*[ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
23
24
*[ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
24
25
*[ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
*[BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
26
28
*[BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+26-2Lines changed: 26 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -435,9 +435,33 @@ How to **investigate**:
435
435
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
436
436
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
437
437
438
-
### CortexCacheRequestErrors
438
+
### CortexMemcachedRequestErrors
439
439
440
-
_TODO: this playbook has not been written yet._
440
+
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
441
+
442
+
How to **investigate**:
443
+
- The alert reports which cache is experiencing issue
444
+
-`metadata-cache`: object store metadata cache
445
+
-`index-cache`: TSDB index cache
446
+
-`chunks-cache`: TSDB chunks cache
447
+
- Check which specific error is occurring
448
+
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
449
+
```
450
+
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
451
+
```
452
+
- Based on the **`reason`**:
453
+
- `timeout`
454
+
- Scale up the memcached replicas
455
+
- `server-error`
456
+
- Check both Cortex and memcached logs to find more details
457
+
- `network-error`
458
+
- Check Cortex logs to find more details
459
+
- `malformed-key`
460
+
- The key is too long or contains invalid characters
461
+
- Check Cortex logs to find the offending key
462
+
- Fixing this will require changes to the application code
463
+
- `other`
464
+
- Check both Cortex and memcached logs to find more details
0 commit comments