Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit d3ec6ed

Browse files
authored
Merge branch 'main' into playbook-for-CortexCacheRequestErrors
2 parents 6b9cedc + c92bec2 commit d3ec6ed

File tree

3 files changed

+31
-24
lines changed

3 files changed

+31
-24
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
* [CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
1919
* [CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
2020
* [CHANGE] Removed `CortexCacheRequestErrors` alert. This alert was not working because the legacy Cortex cache client instrumentation doesn't track errors. #346
21+
* [CHANGE] Removed `CortexQuerierCapacityFull` alert. #342
2122
* [ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
2223
* [ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
2324
* [ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
@@ -80,7 +81,7 @@
8081
- Cortex / Queries: added bucket index load operations and latency (available only when bucket index is enabled)
8182
- Alerts: added "CortexBucketIndexNotUpdated" (bucket index only) and "CortexTenantHasPartialBlocks"
8283
* [ENHANCEMENT] The name of the overrides configmap is now customisable via `$._config.overrides_configmap`. #244
83-
* [ENHANCEMENT] Added flag to control usage of bucket-index, and enable it by default when using blocks. #254
84+
* [ENHANCEMENT] Added flag to control usage of bucket-index and disable it by default when using blocks. #254
8485
* [ENHANCEMENT] Added the alert `CortexIngesterHasUnshippedBlocks`. #255
8586
* [BUGFIX] Honor configured `per_instance_label` in all panels. #239
8687
* [BUGFIX] `CortexRequestLatency` alert now ignores long-running requests on query-scheduler. #242

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -134,21 +134,6 @@
134134
|||,
135135
},
136136
},
137-
{
138-
alert: 'CortexQuerierCapacityFull',
139-
expr: |||
140-
prometheus_engine_queries_concurrent_max{job=~".+/(cortex|ruler|querier)"} - prometheus_engine_queries{job=~".+/(cortex|ruler|querier)"} == 0
141-
|||,
142-
'for': '5m', // We don't want to block for longer.
143-
labels: {
144-
severity: 'critical',
145-
},
146-
annotations: {
147-
message: |||
148-
{{ $labels.job }} is at capacity processing queries.
149-
|||,
150-
},
151-
},
152137
{
153138
alert: 'CortexFrontendQueriesStuck',
154139
expr: |||

cortex-mixin/docs/playbooks.md

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -50,10 +50,12 @@ How the limit is **configured**:
5050
- The configured limit can be queried via `cortex_ingester_instance_limits{limit="max_series"}`
5151

5252
How to **fix**:
53+
1. **Temporarily increase the limit**<br />
54+
If the actual number of series is very close or already hit the limit, or if you foresee the ingester will hit the limit before dropping the stale series as effect of the scale up, you should also temporarily increase the limit.
55+
1. **Check if shuffle-sharding shard size is correct**<br />
56+
When shuffle-sharding is enabled, we target to 100K series / tenant / ingester. You can run `avg by (user) (cortex_ingester_memory_series_created_total{namespace="<namespace>"} - cortex_ingester_memory_series_removed_total{namespace="<namespace>"}) > 100000` to find out tenants with > 100K series / ingester. You may want to increase the shard size for these tenants.
5357
1. **Scale up ingesters**<br />
5458
Scaling up ingesters will lower the number of series per ingester. However, the effect of this change will take up to 4h, because after the scale up we need to wait until all stale series are dropped from memory as the effect of TSDB head compaction, which could take up to 4h (with the default config, TSDB keeps in-memory series up to 3h old and it gets compacted every 2h).
55-
2. **Temporarily increase the limit**<br />
56-
If the actual number of series is very close or already hit the limit, or if you foresee the ingester will hit the limit before dropping the stale series as effect of the scale up, you should also temporarily increase the limit.
5759

5860
### CortexIngesterReachingTenantsLimit
5961

@@ -402,17 +404,36 @@ How to **investigate**:
402404
- Check the latest runtime config update (it's likely to be broken)
403405
- Check Cortex logs to get more details about what's wrong with the config
404406

405-
### CortexQuerierCapacityFull
406-
407-
_TODO: this playbook has not been written yet._
408-
409407
### CortexFrontendQueriesStuck
410408

411-
_TODO: this playbook has not been written yet._
409+
This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue.
410+
411+
The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details.
412412

413413
### CortexSchedulerQueriesStuck
414414

415-
_TODO: this playbook has not been written yet._
415+
This alert fires if queries are piling up in the query-scheduler.
416+
417+
How it **works**:
418+
- A query-frontend API endpoint is called to execute a query
419+
- The query-frontend enqueues the request to the query-scheduler
420+
- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers
421+
- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query
422+
423+
How to **investigate**:
424+
- Are queriers in a crash loop (eg. OOMKilled)?
425+
- `OOMKilled`: temporarily increase queriers memory request/limit
426+
- `panic`: look for the stack trace in the logs and investigate from there
427+
- Is QPS increased?
428+
- Scale up queriers to satisfy the increased workload
429+
- Is query latency increased?
430+
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue
431+
- Temporarily scale up queriers to try to stop the bleed
432+
- Check if a specific tenant is running heavy queries
433+
- Run `sum by (user) (cortex_query_scheduler_queue_length{namespace="<namespace>"}) > 0` to find tenants with enqueued queries
434+
- Check the `Cortex / Slow Queries` dashboard to find slow queries
435+
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
436+
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
416437

417438
### CortexMemcachedRequestErrors
418439

0 commit comments

Comments
 (0)