You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 28, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: cortex-mixin/docs/playbooks.md
+19-1Lines changed: 19 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -451,7 +451,25 @@ How to **fix**:
451
451
452
452
### CortexAllocatingTooMuchMemory
453
453
454
-
_TODO: this playbook has not been written yet._
454
+
This alert fires when an ingester memory utilization is getting closer to the limit.
455
+
456
+
How it **works**:
457
+
- Cortex ingesters are a stateful service
458
+
- Having 2+ ingesters `OOMKilled` may cause a cluster outage
459
+
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
460
+
- Ingester memory short spikes are primarily influenced by queries
461
+
- A pod gets `OOMKilled` once it's working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
462
+
463
+
How to **fix**:
464
+
- Check if the issue occurs only for few ingesters. If so:
465
+
- Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready)
466
+
```
467
+
kubectl -n <namespace> delete pod ingester-XXX
468
+
```
469
+
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
470
+
- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so:
471
+
- Scale up ingesters
472
+
- Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h)
0 commit comments