You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@
18
18
*[CHANGE] Increased `CortexBadRuntimeConfig` alert severity to `critical` and removed support for `cortex_overrides_last_reload_successful` metric (was removed in Cortex 1.3.0). #335
19
19
*[CHANGE] Grafana 'min step' changed to 15s so dashboard show better detail. #340
20
20
*[CHANGE] Replace `CortexRulerFailedEvaluations` with two new alerts: `CortexRulerTooManyFailedPushes` and `CortexRulerTooManyFailedQueries`. #347
*[ENHANCEMENT] cortex-mixin: Make `cluster_namespace_deployment:kube_pod_container_resource_requests_{cpu_cores,memory_bytes}:sum` backwards compatible with `kube-state-metrics` v2.0.0. #317
22
23
*[ENHANCEMENT] Added documentation text panels and descriptions to reads and writes dashboards. #324
23
24
*[ENHANCEMENT] Dashboards: defined container functions for common resources panels: containerDiskWritesPanel, containerDiskReadsPanel, containerDiskSpaceUtilization. #331
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+25-6Lines changed: 25 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -419,17 +419,36 @@ How to **investigate**:
419
419
- Check the latest runtime config update (it's likely to be broken)
420
420
- Check Cortex logs to get more details about what's wrong with the config
421
421
422
-
### CortexQuerierCapacityFull
423
-
424
-
_TODO: this playbook has not been written yet._
425
-
426
422
### CortexFrontendQueriesStuck
427
423
428
-
_TODO: this playbook has not been written yet._
424
+
This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue.
425
+
426
+
The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details.
429
427
430
428
### CortexSchedulerQueriesStuck
431
429
432
-
_TODO: this playbook has not been written yet._
430
+
This alert fires if queries are piling up in the query-scheduler.
431
+
432
+
How it **works**:
433
+
- A query-frontend API endpoint is called to execute a query
434
+
- The query-frontend enqueues the request to the query-scheduler
435
+
- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers
436
+
- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query
-`panic`: look for the stack trace in the logs and investigate from there
442
+
- Is QPS increased?
443
+
- Scale up queriers to satisfy the increased workload
444
+
- Is query latency increased?
445
+
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue
446
+
- Temporarily scale up queriers to try to stop the bleed
447
+
- Check if a specific tenant is running heavy queries
448
+
- Run `sum by (user) (cortex_query_scheduler_queue_length{namespace="<namespace>"}) > 0` to find tenants with enqueued queries
449
+
- Check the `Cortex / Slow Queries` dashboard to find slow queries
450
+
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
451
+
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
0 commit comments