From 232fad03663e082da7b948e7d93b7a3350a4c225 Mon Sep 17 00:00:00 2001 From: Vincent Mercier Date: Mon, 1 Sep 2025 13:46:30 +0200 Subject: [PATCH 1/4] refact(content): Move Storage increase commands to shortcode --- content/runbooks/rds/RDSDiskSpaceLimit.md | 44 ++----------------- .../aws-rds-storage-increase-commands.md | 35 +++++++++++++++ 2 files changed, 39 insertions(+), 40 deletions(-) create mode 100644 layouts/shortcodes/aws-rds-storage-increase-commands.md diff --git a/content/runbooks/rds/RDSDiskSpaceLimit.md b/content/runbooks/rds/RDSDiskSpaceLimit.md index 282755e..1c1eba8 100644 --- a/content/runbooks/rds/RDSDiskSpaceLimit.md +++ b/content/runbooks/rds/RDSDiskSpaceLimit.md @@ -60,51 +60,15 @@ Determine whether it's a long-term growth trend requiring storage increase or ab ## Mitigation -You must avoid reaching no disk space left situation. +Increase RDS disk space -- Fix the system that blocks PostgreSQL to recycle its WAL files - - - If long-running transactions/queries: Cancel or kill the transactions - - If non-running replication slot: Delete replication slot - -- Increase RDS disk space - - {{< hint danger >}} +{{< hint danger >}} {{% aws-rds-storage-increase-limitations %}} {{< /hint >}} - 1. Set AWS_PROFILE - - ```bash - export AWS_PROFILE= - ``` - - 2. Determine the minimum storage for the increase - 💡 RDS requires a minimal storage increase of 10% - - ```bash - INSTANCE_IDENTIFIER= - ``` - - ```bash - aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \ - | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}' - ``` - - 3. Increase storage: - - ```bash - NEW_ALLOCATED_STORAGE= - ``` - - ```bash - aws rds modify-db-instance --db-instance-identifier ${INSTANCE_IDENTIFIER} --allocated-storage ${NEW_ALLOCATED_STORAGE} --apply-immediately \ - | jq .DBInstance.PendingModifiedValues - ``` - - ❗ If the RDS instance has replicas instances (replica or reporting), you must repeat the operation for all replicas to keep the same configuration between instances +{{% aws-rds-storage-increase-commands %}} - 4. Backport changes in Terraform +1. Backport changes in Terraform ## Additional resources diff --git a/layouts/shortcodes/aws-rds-storage-increase-commands.md b/layouts/shortcodes/aws-rds-storage-increase-commands.md new file mode 100644 index 0000000..b927a15 --- /dev/null +++ b/layouts/shortcodes/aws-rds-storage-increase-commands.md @@ -0,0 +1,35 @@ + + +1. Set AWS_PROFILE + + ```bash + export AWS_PROFILE= + ``` + +2. Determine the minimum storage for the increase + + 💡 RDS requires a minimal storage increase of 10% + + ```bash + INSTANCE_IDENTIFIER= + ``` + + ```bash + aws rds describe-db-instances --db-instance-identifier ${INSTANCE_IDENTIFIER} \ + | jq -r '{"Current IOPS": .DBInstances[0].Iops, "Current Storage Limit": .DBInstances[0].AllocatedStorage, "New minimum storage size": ((.DBInstances[0].AllocatedStorage|tonumber)+(.DBInstances[0].AllocatedStorage|tonumber*0.1|floor))}' + ``` + +3. Increase storage: + + ```bash + NEW_ALLOCATED_STORAGE= + ``` + + ```bash + aws rds modify-db-instance --db-instance-identifier ${INSTANCE_IDENTIFIER} --allocated-storage ${NEW_ALLOCATED_STORAGE} --apply-immediately \ + | jq .DBInstance.PendingModifiedValues + ``` + + Instance will quickly pass in `modifying` then `storage-optimization` status. + + ❗ If the RDS instance has replicas instances, you must repeat the operation for each replicas to keep the same configuration between instances From c1c75a3df2c80ec7c34a1dd29efa71b69fcddbc0 Mon Sep 17 00:00:00 2001 From: Vincent Mercier Date: Mon, 1 Sep 2025 13:47:05 +0200 Subject: [PATCH 2/4] chore(shortcode): Add command to list RDS events --- layouts/shortcodes/aws-rds-list-events.md | 9 +++++++++ 1 file changed, 9 insertions(+) create mode 100644 layouts/shortcodes/aws-rds-list-events.md diff --git a/layouts/shortcodes/aws-rds-list-events.md b/layouts/shortcodes/aws-rds-list-events.md new file mode 100644 index 0000000..c0a5bf6 --- /dev/null +++ b/layouts/shortcodes/aws-rds-list-events.md @@ -0,0 +1,9 @@ + + +```bash +aws rds describe-events \ +--source-identifier ${INSTANCE_IDENTIFIER} \ +--duration 720 \ +--source-type db-instance \ +| jq -r '.Events[] | "\(.Date) [\(.EventCategories[0])] \(.Message)"' +``` From 64d00fb733babd96f5b8d34617401eb85f90b7bb Mon Sep 17 00:00:00 2001 From: Vincent Mercier Date: Mon, 1 Sep 2025 13:47:29 +0200 Subject: [PATCH 3/4] chore(shortcode): Add storage optimization --- layouts/shortcodes/aws-rds-status-storage-optimization.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 layouts/shortcodes/aws-rds-status-storage-optimization.md diff --git a/layouts/shortcodes/aws-rds-status-storage-optimization.md b/layouts/shortcodes/aws-rds-status-storage-optimization.md new file mode 100644 index 0000000..2aacb10 --- /dev/null +++ b/layouts/shortcodes/aws-rds-status-storage-optimization.md @@ -0,0 +1,5 @@ + + +Storage optimization can take several hours depending of the instance storage type. + +During this process, the instance performance may be impacted, and further storage modifications are deferred until optimization is complete. From e1aeb89ca9e7eebc3736135d13719ceda4be9e64 Mon Sep 17 00:00:00 2001 From: Vincent Mercier Date: Mon, 1 Sep 2025 14:01:05 +0200 Subject: [PATCH 4/4] feat(runbook): Add RDSFullDiskSpace alert --- charts/prometheus-rds-alerts/values.yaml | 11 ++++- content/runbooks/rds/RDSDiskSpaceLimit.md | 9 ++++ content/runbooks/rds/RDSFullDiskSpace.md | 60 +++++++++++++++++++++++ 3 files changed, 79 insertions(+), 1 deletion(-) create mode 100644 content/runbooks/rds/RDSFullDiskSpace.md diff --git a/charts/prometheus-rds-alerts/values.yaml b/charts/prometheus-rds-alerts/values.yaml index ebdba60..e311f8f 100644 --- a/charts/prometheus-rds-alerts/values.yaml +++ b/charts/prometheus-rds-alerts/values.yaml @@ -45,7 +45,7 @@ rules: severity: warning annotations: summary: "Less than 20% free disk space on at least one instance" - description: 'One or more RDS instances has <20% free disk space' + description: "One or more RDS instances has <20% free disk space" RDSDiskSpaceLimit: expr: max by (aws_account_id, aws_region, dbidentifier) (rds_free_storage_bytes{} * 100 / rds_allocated_storage_bytes{}) < 10 @@ -204,3 +204,12 @@ rules: annotations: summary: "RDS instance(s) use(s) a certificate with an expiration date inferior to 15 days" description: "{{ $value }} instance(s) of the AWS account ID={{ $labels.aws_account_id}} in region={{ $labels.aws_region }} use(s) a certificate with an expiration date inferior to 15 days" + + RDSFullDiskSpace: + expr: max by (aws_account_id, aws_region, dbidentifier) (rds_instance_status{}) == -7 + for: 5m + labels: + severity: critical + annotations: + summary: "Instance storage is full" + description: "{{ $labels.dbidentifier }} storage is full" diff --git a/content/runbooks/rds/RDSDiskSpaceLimit.md b/content/runbooks/rds/RDSDiskSpaceLimit.md index 1c1eba8..738fe0b 100644 --- a/content/runbooks/rds/RDSDiskSpaceLimit.md +++ b/content/runbooks/rds/RDSDiskSpaceLimit.md @@ -60,8 +60,17 @@ Determine whether it's a long-term growth trend requiring storage increase or ab ## Mitigation +You must avoid reaching no disk space left situation. Increase RDS disk space +- Fix the system that blocks PostgreSQL to recycle its WAL files +{{< hint danger >}} + + - If long-running transactions/queries: Cancel or kill the transactions + - If non-running replication slot: Delete replication slot + +- Increase RDS disk space + {{< hint danger >}} {{% aws-rds-storage-increase-limitations %}} {{< /hint >}} diff --git a/content/runbooks/rds/RDSFullDiskSpace.md b/content/runbooks/rds/RDSFullDiskSpace.md new file mode 100644 index 0000000..550e073 --- /dev/null +++ b/content/runbooks/rds/RDSFullDiskSpace.md @@ -0,0 +1,60 @@ +--- +title: Full disk space +--- + +# RDSFullDiskSpace + +## Meaning + +Alert is triggered when RDS instance storage is full + +## Impact + +PostgreSQL automatically stops when it detects there is no more disk space available. + +**All database accesses are blocked**, causing application errors + +## Diagnosis + +You need to increase the RDS storage. Determine whether it's a long-term growth trend requiring storage increase or abnormal disk usage reflecting another problem. + +{{< hint danger >}} +Since RDS disk space cannot be reduced and storage modifications are limited to once every 6 hours, you should carefully evaluate your storage requirements before making changes." +{{< /hint >}} + +## Mitigation + +RDS instances is **no more reachable**, you **must increase the RDS storage allocated disk**. + +{{< hint danger >}} +{{% aws-rds-storage-increase-limitations %}} +{{< /hint >}} + +{{% aws-rds-storage-increase-commands %}} + +1. Wait for instance to pass in `storage-optimization` status + + The instance becomes accessible after the `modifying` operation is complete. + + {{}} + {{% aws-rds-status-storage-optimization %}} + {{< /hint >}} + + See RDS instance status: + + ```bash + aws rds describe-db-instances \ + --db-instance-identifier ${INSTANCE_IDENTIFIER} \ + --query "DBInstances[0].[DBInstanceStatus]" + ``` + + Additionally you can follow RDS event for this instance: + + {{% aws-rds-list-events %}} + +1. Backport changes in Terraform + +## Additional resources + +- [RDS Storage Modification](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.ModifyingExisting) +- [AWS RDS Storage Autoscaling](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIOPS.StorageTypes.html#USER_PIOPS.Autoscaling)