Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit bb5a40a

Browse files
committed
Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck
Signed-off-by: Marco Pracucci <marco@pracucci.com>
1 parent 90a7809 commit bb5a40a

File tree

1 file changed

+22
-2
lines changed

1 file changed

+22
-2
lines changed

cortex-mixin/docs/playbooks.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -408,11 +408,31 @@ _TODO: this playbook has not been written yet._
408408

409409
### CortexFrontendQueriesStuck
410410

411-
_TODO: this playbook has not been written yet._
411+
This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue.
412+
413+
The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details.
412414

413415
### CortexSchedulerQueriesStuck
414416

415-
_TODO: this playbook has not been written yet._
417+
This alert fires if Cortex is queries are piling up in the query-scheduler.
418+
419+
How it **works**:
420+
- A query-frontend API endpoint is called to execute a query
421+
- The query-frontend enqueues the request to the query-scheduler
422+
- The query-scheduler is responsible to dispatch enqueued queries to idle querier workers
423+
- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler
424+
425+
How to **investigate**:
426+
- Are queriers in a crash loop (eg. OOMKilled)?
427+
- `OOMKilled`: temporarily increase queriers memory request/limit
428+
- `panic`: look for the stack trace in the logs and investigate from there
429+
- Is QPS increased?
430+
- Scale up queriers to satisfy the increased workload
431+
- Is query latency increased?
432+
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue
433+
- Temporarily scale up queriers to try to stop the bleed
434+
- Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries
435+
- If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
416436

417437
### CortexCacheRequestErrors
418438

0 commit comments

Comments
 (0)