Skip to content
This repository was archived by the owner on Apr 28, 2025. It is now read-only.

Commit 3528572

Browse files
authored
Merge pull request #345 from grafana/playbook-for-CortexAllocatingTooMuchMemory
Added playbook for CortexAllocatingTooMuchMemory
2 parents 12293f0 + f8b162b commit 3528572

File tree

2 files changed

+22
-4
lines changed

2 files changed

+22
-4
lines changed

cortex-mixin/alerts/alerts.libsonnet

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -479,7 +479,7 @@
479479
},
480480
annotations: {
481481
message: |||
482-
High QPS for ingesters, add more ingesters.
482+
Ingesters in {{ $labels.namespace }} ingest too many samples per second.
483483
|||,
484484
},
485485
},
@@ -498,7 +498,7 @@
498498
},
499499
annotations: {
500500
message: |||
501-
Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
501+
Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
502502
|||,
503503
},
504504
},
@@ -517,7 +517,7 @@
517517
},
518518
annotations: {
519519
message: |||
520-
Too much memory being used by {{ $labels.namespace }}/{{ $labels.pod }} - add more ingesters.
520+
Ingester {{ $labels.namespace }}/{{ $labels.pod }} is using too much memory.
521521
|||,
522522
},
523523
},

cortex-mixin/docs/playbooks.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -451,7 +451,25 @@ How to **fix**:
451451
452452
### CortexAllocatingTooMuchMemory
453453
454-
_TODO: this playbook has not been written yet._
454+
This alert fires when an ingester memory utilization is getting closer to the limit.
455+
456+
How it **works**:
457+
- Cortex ingesters are a stateful service
458+
- Having 2+ ingesters `OOMKilled` may cause a cluster outage
459+
- Ingester memory baseline usage is primarily influenced by memory allocated by the process (mostly go heap) and mmap-ed files (used by TSDB)
460+
- Ingester memory short spikes are primarily influenced by queries and TSDB head compaction into new blocks (occurring every 2h)
461+
- A pod gets `OOMKilled` once its working set memory reaches the configured limit, so it's important to prevent ingesters memory utilization (working set memory) from getting close to the limit (we need to keep at least 30% room for spikes due to queries)
462+
463+
How to **fix**:
464+
- Check if the issue occurs only for few ingesters. If so:
465+
- Restart affected ingesters 1 by 1 (proceed with the next one once the previous pod has restarted and it's Ready)
466+
```
467+
kubectl -n <namespace> delete pod ingester-XXX
468+
```
469+
- Restarting an ingester typically reduces the memory allocated by mmap-ed files. After the restart, ingester may allocate this memory again over time, but it may give more time while working on a longer term solution
470+
- Check the `Cortex / Writes Resources` dashboard to see if the number of series per ingester is above the target (1.5M). If so:
471+
- Scale up ingesters
472+
- Memory is expected to be reclaimed at the next TSDB head compaction (occurring every 2h)
455473
456474
### CortexGossipMembersMismatch
457475

0 commit comments

Comments
 (0)