You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+22-2Lines changed: 22 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -408,11 +408,31 @@ _TODO: this playbook has not been written yet._
408
408
409
409
### CortexFrontendQueriesStuck
410
410
411
-
_TODO: this playbook has not been written yet._
411
+
This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue.
412
+
413
+
The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details.
412
414
413
415
### CortexSchedulerQueriesStuck
414
416
415
-
_TODO: this playbook has not been written yet._
417
+
This alert fires if Cortex is queries are piling up in the query-scheduler.
418
+
419
+
How it **works**:
420
+
- A query-frontend API endpoint is called to execute a query
421
+
- The query-frontend enqueues the request to the query-scheduler
422
+
- The query-scheduler is responsible to dispatch enqueued queries to idle querier workers
423
+
- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler
-`panic`: look for the stack trace in the logs and investigate from there
429
+
- Is QPS increased?
430
+
- Scale up queriers to satisfy the increased workload
431
+
- Is query latency increased?
432
+
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue
433
+
- Temporarily scale up queriers to try to stop the bleed
434
+
- Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries
435
+
- If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
0 commit comments