Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit c6b4464

Browse files
committed
Addressed review comments
Signed-off-by: Marco Pracucci <marco@pracucci.com>
1 parent 6421751 commit c6b4464

File tree

1 file changed

+28
-2
lines changed

1 file changed

+28
-2
lines changed

cortex-mixin/docs/playbooks.md

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ How to **fix**:
9090

9191
This alert fires when a specific Cortex route is experiencing an high latency.
9292

93-
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that.
93+
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that (see [Cortex routes by path](#cortex-routes-by-path)).
9494

9595
#### Write Latency
9696

@@ -106,6 +106,9 @@ How to **investigate**:
106106
- Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors.
107107
- **`ingester`**
108108
- Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters.
109+
- Check out the following alerts and fix them if firing:
110+
- `CortexProvisioningTooManyActiveSeries`
111+
- `CortexProvisioningTooManyWrites`
109112

110113
#### Read Latency
111114

@@ -130,6 +133,7 @@ How to **investigate**:
130133
- High CPU utilization in ingesters
131134
- Scale up ingesters
132135
- Low cache hit ratio in the store-gateways
136+
- Check `Memcached Overview` dashboard
133137
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary.
134138
- If memcached eviction rate is zero or very low, then it may be caused by "first time" queries
135139

@@ -140,7 +144,7 @@ This alert fires when the rate of 5xx errors of a specific route is > 1% for som
140144
This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one (but the investigation procedure is typically the same).
141145

142146
How to **investigate**:
143-
- Check for which route the alert fired
147+
- Check for which route the alert fired (see [Cortex routes by path](#cortex-routes-by-path))
144148
- Write path: open the `Cortex / Writes` dashboard
145149
- Read path: open the `Cortex / Reads` dashboard
146150
- Looking at the dashboard you should see in which Cortex service the error originates
@@ -588,6 +592,28 @@ This can be triggered if there are too many HA dedupe keys in etcd. We saw this
588592
},
589593
```
590594
595+
## Cortex routes by path
596+
597+
**Write path**:
598+
- `/distributor.Distributor/Push`
599+
- `/cortex.Ingester/Push`
600+
- `api_v1_push`
601+
- `api_prom_push`
602+
- `api_v1_push_influx_write`
603+
604+
**Read path**:
605+
- `/schedulerpb.SchedulerForFrontend/FrontendLoop`
606+
- `/cortex.Ingester/QueryStream`
607+
- `/cortex.Ingester/QueryExemplars`
608+
- `/gatewaypb.StoreGateway/Series`
609+
- `api_prom_label`
610+
- `api_prom_api_v1_query_exemplars`
611+
612+
**Ruler / rules path**:
613+
- `api_v1_rules`
614+
- `api_v1_rules_namespace`
615+
- `api_prom_rules_namespace`
616+
591617
## Cortex blocks storage - What to do when things to wrong
592618
593619
## Recovering from a potential data loss incident

0 commit comments

Comments
 (0)