You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+28-2Lines changed: 28 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -90,7 +90,7 @@ How to **fix**:
90
90
91
91
This alert fires when a specific Cortex route is experiencing an high latency.
92
92
93
-
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that.
93
+
The alert message includes both the Cortex service and route experiencing the high latency. Establish if the alert is about the read or write path based on that (see [Cortex routes by path](#cortex-routes-by-path)).
94
94
95
95
#### Write Latency
96
96
@@ -106,6 +106,9 @@ How to **investigate**:
106
106
- Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the distributors.
107
107
-**`ingester`**
108
108
- Typically, ingester p99 latency is in the range 5-50ms. If the ingester latency is higher than this, you should investigate the root cause before scaling up ingesters.
109
+
- Check out the following alerts and fix them if firing:
110
+
-`CortexProvisioningTooManyActiveSeries`
111
+
-`CortexProvisioningTooManyWrites`
109
112
110
113
#### Read Latency
111
114
@@ -130,6 +133,7 @@ How to **investigate**:
130
133
- High CPU utilization in ingesters
131
134
- Scale up ingesters
132
135
- Low cache hit ratio in the store-gateways
136
+
- Check `Memcached Overview` dashboard
133
137
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Cortex / Scaling` dashboard and make reasonable adjustments as necessary.
134
138
- If memcached eviction rate is zero or very low, then it may be caused by "first time" queries
135
139
@@ -140,7 +144,7 @@ This alert fires when the rate of 5xx errors of a specific route is > 1% for som
140
144
This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one (but the investigation procedure is typically the same).
141
145
142
146
How to **investigate**:
143
-
- Check for which route the alert fired
147
+
- Check for which route the alert fired (see [Cortex routes by path](#cortex-routes-by-path))
144
148
- Write path: open the `Cortex / Writes` dashboard
145
149
- Read path: open the `Cortex / Reads` dashboard
146
150
- Looking at the dashboard you should see in which Cortex service the error originates
@@ -588,6 +592,28 @@ This can be triggered if there are too many HA dedupe keys in etcd. We saw this
0 commit comments