Added playbook for CortexAllocatingTooMuchMemory

pracucci · pracucci · commit 1e56681df857 · 2021-07-02T12:20:19.000+02:00
Signed-off-by: Marco Pracucci &lt;marco@pracucci.com&gt;
diff --git a/cortex-mixin/alerts/alerts.libsonnet b/cortex-mixin/alerts/alerts.libsonnet
@@ -479,7 +479,7 @@
           },
           annotations: {
             message: |||
-              High QPS for ingesters, add more ingesters.
+              Ingesters in {{ $labels.namespace }} have an high samples/sec rate.
             |||,
           },
         },
@@ -498,7 +498,7 @@
           },
           annotations: {
             message: |||
-              Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
+              Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
             |||,
           },
         },
@@ -517,7 +517,7 @@
           },
           annotations: {
             message: |||
-              Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
+              Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
             |||,
           },
         },
diff --git a/cortex-mixin/docs/playbooks.md b/cortex-mixin/docs/playbooks.md
@@ -451,7 +451,25 @@ How to **fix**:
 
 ### CortexAllocatingTooMuchMemory
 
-_TODO: this playbook has not been written yet._
+This alert fires when an ingester memory utilization is getting closer to the limit.
+
+How it **works**:
+- Cortex ingesters are a stateful service
+- Having 2+ ingesters `OOMKilled` may cause a cluster outage
+- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
+- Ingester memory short spikes are primarily influenced by queries
+- A pod gets `OOMKilled` once it's working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
+
+How to **fix**:
+- Check if the issue occurs only for few ingesters. If so:
+  - Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready)
+    ```
+    kubectl -n <namespace> delete pod ingester-XXX
+    ```
+  - Restarting an ingester typically reduces the memory allocated by mmap-ed files. Such memory could be reallocated again, but may let you gain more time while working on a longer term solution
+- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so:
+  - Scale up ingesters
+  - Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h)
 
 ### CortexGossipMembersMismatch