Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 53d67f5

Browse files
authored
Merge branch 'main' into ruler-alerts
2 parents 6fc5f2f + 2137f76 commit 53d67f5

File tree

3 files changed

+40
-12
lines changed

3 files changed

+40
-12
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,13 @@
1818
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1919
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
2020
* [CHANGE] Replace `CortexRulerFailedEvaluations` with two new alerts: `CortexRulerTooManyFailedPushes` and `CortexRulerTooManyFailedQueries`. #347
21+
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
2122
* [CHANGE] Removed `CortexQuerierCapacityFull` alert. #342
2223
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
2324
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
2425
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
2526
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
27+
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
2628
* [BUGFIX] Fixed `CortexIngesterHasNotShippedBlocks` alert false positive in case an ingester instance had ingested samples in the past, then no traffic was received for a long period and then it started receiving samples again. #308
2729
* [BUGFIX] Alertmanager: fixed `--alertmanager.cluster.peers` CLI flag passed to alertmanager when HA is enabled. #329
2830
* [BUGFIX] Fixed `CortexInconsistentRuntimeConfig` metric. #335
@@ -80,7 +82,7 @@
8082
- Cortex / Queries: added bucket index load operations and latency (available only when bucket index is enabled)
8183
- Alerts: added "CortexBucketIndexNotUpdated" (bucket index only) and "CortexTenantHasPartialBlocks"
8284
* [ENHANCEMENT] The name of the overrides configmap is now customisable via `$._config.overrides_configmap`. #244
83-
* [ENHANCEMENT] Added flag to control usage of bucket-index, and enable it by default when using blocks. #254
85+
* [ENHANCEMENT] Added flag to control usage of bucket-index and disable it by default when using blocks. #254
8486
* [ENHANCEMENT] Added the alert `CortexIngesterHasUnshippedBlocks`. #255
8587
* [BUGFIX] Honor configured `per_instance_label` in all panels. #239
8688
* [BUGFIX] `CortexRequestLatency` alert now ignores long-running requests on query-scheduler. #242

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -165,20 +165,20 @@
165165
},
166166
},
167167
{
168-
alert: 'CortexCacheRequestErrors',
168+
alert: 'CortexMemcachedRequestErrors',
169169
expr: |||
170-
100 * sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count{status_code=~"5.."}[1m]))
171-
/
172-
sum by (%s, method) (rate(cortex_cache_request_duration_seconds_count[1m]))
173-
> 1
170+
(
171+
sum by(%s, name, operation) (rate(thanos_memcached_operation_failures_total[1m])) /
172+
sum by(%s, name, operation) (rate(thanos_memcached_operations_total[1m]))
173+
) * 100 > 5
174174
||| % [$._config.alert_aggregation_labels, $._config.alert_aggregation_labels],
175-
'for': '15m',
175+
'for': '5m',
176176
labels: {
177177
severity: 'warning',
178178
},
179179
annotations: {
180180
message: |||
181-
Cache {{ $labels.method }} is experiencing {{ printf "%.2f" $value }}% errors.
181+
Memcached {{ $labels.name }} used by Cortex in {{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
182182
|||,
183183
},
184184
},

cortex-mixin/docs/playbooks.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,12 @@ How the limit is **configured**:
5050
- The configured limit can be queried via `cortex_ingester_instance_limits{limit="max_series"}`
5151

5252
How to **fix**:
53+
1. **Temporarily increase the limit**<br />
54+
If the actual number of series is very close or already hit the limit, or if you foresee the ingester will hit the limit before dropping the stale series as effect of the scale up, you should also temporarily increase the limit.
55+
1. **Check if shuffle-sharding shard size is correct**<br />
56+
When shuffle-sharding is enabled, we target to 100K series / tenant / ingester. You can run `avg by (user) (cortex_ingester_memory_series_created_total{namespace="<namespace>"} - cortex_ingester_memory_series_removed_total{namespace="<namespace>"}) > 100000` to find out tenants with > 100K series / ingester. You may want to increase the shard size for these tenants.
5357
1. **Scale up ingesters**<br />
5458
Scaling up ingesters will lower the number of series per ingester. However, the effect of this change will take up to 4h, because after the scale up we need to wait until all stale series are dropped from memory as the effect of TSDB head compaction, which could take up to 4h (with the default config, TSDB keeps in-memory series up to 3h old and it gets compacted every 2h).
55-
2. **Temporarily increase the limit**<br />
56-
If the actual number of series is very close or already hit the limit, or if you foresee the ingester will hit the limit before dropping the stale series as effect of the scale up, you should also temporarily increase the limit.
5759

5860
### CortexIngesterReachingTenantsLimit
5961

@@ -450,9 +452,33 @@ How to **investigate**:
450452
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
451453
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
452454

453-
### CortexCacheRequestErrors
455+
### CortexMemcachedRequestErrors
454456

455-
_TODO: this playbook has not been written yet._
457+
This alert fires if Cortex memcached client is experiencing an high error rate for a specific cache and operation.
458+
459+
How to **investigate**:
460+
- The alert reports which cache is experiencing issue
461+
- `metadata-cache`: object store metadata cache
462+
- `index-cache`: TSDB index cache
463+
- `chunks-cache`: TSDB chunks cache
464+
- Check which specific error is occurring
465+
- Run the following query to find out the reason (replace `<namespace>` with the actual Cortex cluster namespace)
466+
```
467+
sum by(name, operation, reason) (rate(thanos_memcached_operation_failures_total{namespace="<namespace>"}[1m])) > 0
468+
```
469+
- Based on the **`reason`**:
470+
- `timeout`
471+
- Scale up the memcached replicas
472+
- `server-error`
473+
- Check both Cortex and memcached logs to find more details
474+
- `network-error`
475+
- Check Cortex logs to find more details
476+
- `malformed-key`
477+
- The key is too long or contains invalid characters
478+
- Check Cortex logs to find the offending key
479+
- Fixing this will require changes to the application code
480+
- `other`
481+
- Check both Cortex and memcached logs to find more details
456482
457483
### CortexOldChunkInMemory
458484

0 commit comments

Comments
 (0)