|
| 1 | +// Module included in the following assemblies: |
| 2 | +// |
| 3 | +// * network_observability/network-observability-alerts.adoc |
| 4 | + |
| 5 | +:_mod-docs-content-type: REFERENCE |
| 6 | +[id="network-observability-alerts-about-promql-expression_{context}"] |
| 7 | += About the PromQL expression for alerts |
| 8 | + |
| 9 | +[role="_abstract"] |
| 10 | +Learn about the base query for Prometheus Query Language (`PromQL`), and how to customize it so you can configure network observability alerts for your specific needs. |
| 11 | + |
| 12 | +The alerting API in the network observability `FlowCollector` custom resource (`CR`) is mapped to the Prometheus Operator API, generating a `PrometheusRule`. You can see the `PrometheusRule` in the default `netobserv` namespace by running the following command: |
| 13 | + |
| 14 | +[source,terminal] |
| 15 | +---- |
| 16 | +$ oc get prometheusrules -n netobserv -oyaml |
| 17 | +---- |
| 18 | + |
| 19 | +[id="example-example-query-alert-for-surge-in-incoming-traffic_{context}"] |
| 20 | +== An example query for an alert in a surge of incoming traffic |
| 21 | + |
| 22 | +This example provides the base `PromQL` query pattern for an alert about a surge in incoming traffic: |
| 23 | + |
| 24 | +[source,promql] |
| 25 | +---- |
| 26 | +sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) |
| 27 | +---- |
| 28 | + |
| 29 | +This query calculates the byte rate coming from the `openshift-ingress` namespace to any of your workloads' namespaces over the past 30 minutes. |
| 30 | + |
| 31 | +You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold. |
| 32 | + |
| 33 | +Filtering noise:: Appending `> 1000` to this query retains only the rates observed that are greater than `1 KB/s`, which eliminates noise from low-bandwidth consumers. |
| 34 | ++ |
| 35 | +`(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)` |
| 36 | ++ |
| 37 | +The byte rate is relative to the sampling interval defined in the `FlowCollector` custom resource (`CR`) configuration. If the sampling interval is `1:100`, the actual traffic might be approximately 100 times higher than the reported metrics. |
| 38 | + |
| 39 | +Time comparison:: You can run the same query for a particular period of time using the `offset` modifier. For example, a query for one day earlier can be run using `offset 1d`, and a query for five hours ago can be run using `offset 5h`. |
| 40 | ++ |
| 41 | +`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))` |
| 42 | ++ |
| 43 | +You can use the formula `100 * (<query now> - <query from the previous day>) / <query from the previous day>` to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day. |
| 44 | + |
| 45 | +Final threshold:: You can apply a final threshold to filter increases that are lower than the desired percentage. For example, `> 100` eliminates increases that are lower than 100%. |
| 46 | + |
| 47 | +Together, the complete expression for the `PrometheusRule` looks like the following: |
| 48 | + |
| 49 | +[source,promql] |
| 50 | +---- |
| 51 | +... |
| 52 | + expr: |- |
| 53 | + (100 * |
| 54 | + ( |
| 55 | + (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000) |
| 56 | + - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace) |
| 57 | + ) |
| 58 | + / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)) |
| 59 | + > 100 |
| 60 | +---- |
| 61 | + |
| 62 | +[id="alert-metadata-fields_{context}"] |
| 63 | +== Alert metadata fields |
| 64 | + |
| 65 | +The Network Observability Operator uses components from other {product-title} features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture". |
| 66 | + |
| 67 | +Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the `Alertmanager` service from the monitoring stack, or by the *Network Health* dashboard. |
| 68 | + |
| 69 | +The following example shows an `AlertingRule` resource with the configured metadata: |
| 70 | + |
| 71 | +[source,yaml] |
| 72 | +---- |
| 73 | +apiVersion: monitoring.openshift.io/v1 |
| 74 | +kind: AlertingRule |
| 75 | +metadata: |
| 76 | + name: netobserv-alerts |
| 77 | + namespace: openshift-monitoring |
| 78 | +spec: |
| 79 | + groups: |
| 80 | + - name: NetObservAlerts |
| 81 | + rules: |
| 82 | + - alert: NetObservIncomingBandwidth |
| 83 | + annotations: |
| 84 | + netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}' |
| 85 | + message: |- |
| 86 | + NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday. |
| 87 | + summary: "Surge in incoming traffic" |
| 88 | + expr: |- |
| 89 | + (100 * |
| 90 | + ( |
| 91 | + (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000) |
| 92 | + - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace) |
| 93 | + ) |
| 94 | + / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)) |
| 95 | + > 100 |
| 96 | + for: 1m |
| 97 | + labels: |
| 98 | + app: netobserv |
| 99 | + netobserv: "true" |
| 100 | + severity: warning |
| 101 | +---- |
| 102 | + |
| 103 | +where: |
| 104 | + |
| 105 | +`spec.groups.rules.alert.labels.netobserv`:: |
| 106 | +Specifies the alert for the *Network Health* dashboard to detect when set to `true`. |
| 107 | +`spec.groups.rules.alert.labels.severity`:: |
| 108 | +Specifies the severity of the alert. The following values are valid: `critical`, `warning`, or `info`. |
| 109 | + |
| 110 | +You can leverage the output labels from the defined `PromQL` expression in the `message` annotation. In the example, since results are grouped per `DstK8S_Namespace`, the expression `{{ $labels.DstK8S_Namespace }}` is used in the message text. |
| 111 | + |
| 112 | +The `netobserv_io_network_health` annotation is optional, and controls how the alert is rendered on the *Network Health* page. |
| 113 | + |
| 114 | +The `netobserv_io_network_health` annotation is a JSON string consisting of the following fields: |
| 115 | + |
| 116 | +.Fields for the netobserv_io_network_health annotation |
| 117 | +[cols="2,2,6",options="header"] |
| 118 | +|=== |
| 119 | +| Field |
| 120 | +| Type |
| 121 | +| Description |
| 122 | + |
| 123 | +| `namespaceLabels` |
| 124 | +| List of strings |
| 125 | +| One or more labels that hold namespaces. When provided, the alert appears under the *Namespaces* tab. |
| 126 | + |
| 127 | +| `nodeLabels` |
| 128 | +| List of strings |
| 129 | +| One or more labels that hold node names. When provided, the alert appears under the *Nodes* tab. |
| 130 | + |
| 131 | +| `threshold` |
| 132 | +| String |
| 133 | +| The alert threshold, expected to match the threshold defined in the `PromQL` expression. |
| 134 | + |
| 135 | +| `unit` |
| 136 | +| String |
| 137 | +| The data unit, used only for display purposes. |
| 138 | + |
| 139 | +| `upperBound` |
| 140 | +| String |
| 141 | +| An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped. |
| 142 | + |
| 143 | +| `links` |
| 144 | +| List of objects |
| 145 | +| A list of links to display contextually with the alert. Each link requires a `name` (display name) and `url`. |
| 146 | + |
| 147 | +| `trafficLinkFilter` |
| 148 | +| String |
| 149 | +| An additional filter to inject into the URL for the *Network Traffic* page. |
| 150 | +|=== |
| 151 | + |
| 152 | +The `namespaceLabels` and `nodeLabels` are mutually exclusive. If neither is provided, the alert appears under the *Global* tab. |
0 commit comments