Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 34b4d25

Browse files
authored
Merge pull request #341 from grafana/add-playbook-for-stuck-queries-alerts
Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck
2 parents 3528572 + e9dfa71 commit 34b4d25

File tree

1 file changed

+25
-2
lines changed

1 file changed

+25
-2
lines changed

cortex-mixin/docs/playbooks.md

Lines changed: 25 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -408,11 +408,34 @@ _TODO: this playbook has not been written yet._
408408

409409
### CortexFrontendQueriesStuck
410410

411-
_TODO: this playbook has not been written yet._
411+
This alert fires if Cortex is running without query-scheduler and queries are piling up in the query-frontend queue.
412+
413+
The procedure to investigate it is the same as the one for [`CortexSchedulerQueriesStuck`](#CortexSchedulerQueriesStuck): please see the other playbook for more details.
412414

413415
### CortexSchedulerQueriesStuck
414416

415-
_TODO: this playbook has not been written yet._
417+
This alert fires if queries are piling up in the query-scheduler.
418+
419+
How it **works**:
420+
- A query-frontend API endpoint is called to execute a query
421+
- The query-frontend enqueues the request to the query-scheduler
422+
- The query-scheduler is responsible for dispatching enqueued queries to idle querier workers
423+
- The querier runs the query, sends the response back directly to the query-frontend and notifies the query-scheduler that it can process another query
424+
425+
How to **investigate**:
426+
- Are queriers in a crash loop (eg. OOMKilled)?
427+
- `OOMKilled`: temporarily increase queriers memory request/limit
428+
- `panic`: look for the stack trace in the logs and investigate from there
429+
- Is QPS increased?
430+
- Scale up queriers to satisfy the increased workload
431+
- Is query latency increased?
432+
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue
433+
- Temporarily scale up queriers to try to stop the bleed
434+
- Check if a specific tenant is running heavy queries
435+
- Run `sum by (user) (cortex_query_scheduler_queue_length{namespace="<namespace>"}) > 0` to find tenants with enqueued queries
436+
- Check the `Cortex / Slow Queries` dashboard to find slow queries
437+
- On multi-tenant Cortex cluster with **shuffle-sharing for queriers disabled**, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant).
438+
- On multi-tenant Cortex cluster with **shuffle-sharding for queriers enabled**, you may consider to temporarily increase the shard size for affected tenants: be aware that this could affect other tenants too, reducing resources available to run other tenant queries. Alternatively, you may choose to do nothing and let Cortex return errors for that given user once the per-tenant queue is full.
416439

417440
### CortexCacheRequestErrors
418441

0 commit comments

Comments
 (0)