Skip to content

Commit c6ab778

Browse files
authored
Merge pull request #101238 from gwynnemonahan/rebased-no-1.10-integration
Rebased no 1.10 integration
2 parents efa9971 + bf67644 commit c6ab778

File tree

28 files changed

+964
-199
lines changed

28 files changed

+964
-199
lines changed

_topic_maps/_topic_map.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3262,10 +3262,12 @@ Topics:
32623262
Dir: network_observability
32633263
Distros: openshift-enterprise,openshift-origin
32643264
Topics:
3265-
- Name: Network observability release notes 1.9.3
3266-
File: network-observability-release-notes-1-9-3
3267-
- Name: Network observability release notes 1.9.2
3268-
File: network-observability-release-notes-1-9-2
3265+
- Name: Network Observability Operator release notes 1.10
3266+
File: network-observability-operator-release-notes-1-10
3267+
#- Name: Network observability release notes 1.9.3
3268+
# File: network-observability-release-notes-1-9-3
3269+
#- Name: Network observability release notes 1.9.2
3270+
# File: network-observability-release-notes-1-9-2
32693271
- Name: Network observability release notes
32703272
File: network-observability-operator-release-notes
32713273
- Name: Network observability overview
@@ -3280,6 +3282,8 @@ Topics:
32803282
File: network-observability-network-policy
32813283
- Name: Observing the network traffic
32823284
File: observing-network-traffic
3285+
- Name: Network observability alerts
3286+
File: network-observability-alerts
32833287
- Name: Using metrics with dashboards and alerts
32843288
File: metrics-alerts-dashboards
32853289
- Name: Monitoring the Network Observability Operator
Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-alerts.adoc
4+
5+
:_mod-docs-content-type: REFERENCE
6+
[id="network-observability-alerts-about-promql-expression_{context}"]
7+
= About the PromQL expression for alerts
8+
9+
[role="_abstract"]
10+
Learn about the base query for Prometheus Query Language (`PromQL`), and how to customize it so you can configure network observability alerts for your specific needs.
11+
12+
The alerting API in the network observability `FlowCollector` custom resource (`CR`) is mapped to the Prometheus Operator API, generating a `PrometheusRule`. You can see the `PrometheusRule` in the default `netobserv` namespace by running the following command:
13+
14+
[source,terminal]
15+
----
16+
$ oc get prometheusrules -n netobserv -oyaml
17+
----
18+
19+
[id="example-example-query-alert-for-surge-in-incoming-traffic_{context}"]
20+
== An example query for an alert in a surge of incoming traffic
21+
22+
This example provides the base `PromQL` query pattern for an alert about a surge in incoming traffic:
23+
24+
[source,promql]
25+
----
26+
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
27+
----
28+
29+
This query calculates the byte rate coming from the `openshift-ingress` namespace to any of your workloads' namespaces over the past 30 minutes.
30+
31+
You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.
32+
33+
Filtering noise:: Appending `> 1000` to this query retains only the rates observed that are greater than `1 KB/s`, which eliminates noise from low-bandwidth consumers.
34+
+
35+
`(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)`
36+
+
37+
The byte rate is relative to the sampling interval defined in the `FlowCollector` custom resource (`CR`) configuration. If the sampling interval is `1:100`, the actual traffic might be approximately 100 times higher than the reported metrics.
38+
39+
Time comparison:: You can run the same query for a particular period of time using the `offset` modifier. For example, a query for one day earlier can be run using `offset 1d`, and a query for five hours ago can be run using `offset 5h`.
40+
+
41+
`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))`
42+
+
43+
You can use the formula `100 * (<query now> - <query from the previous day>) / <query from the previous day>` to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.
44+
45+
Final threshold:: You can apply a final threshold to filter increases that are lower than the desired percentage. For example, `> 100` eliminates increases that are lower than 100%.
46+
47+
Together, the complete expression for the `PrometheusRule` looks like the following:
48+
49+
[source,promql]
50+
----
51+
...
52+
expr: |-
53+
(100 *
54+
(
55+
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
56+
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
57+
)
58+
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
59+
> 100
60+
----
61+
62+
[id="alert-metadata-fields_{context}"]
63+
== Alert metadata fields
64+
65+
The Network Observability Operator uses components from other {product-title} features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".
66+
67+
Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the `Alertmanager` service from the monitoring stack, or by the *Network Health* dashboard.
68+
69+
The following example shows an `AlertingRule` resource with the configured metadata:
70+
71+
[source,yaml]
72+
----
73+
apiVersion: monitoring.openshift.io/v1
74+
kind: AlertingRule
75+
metadata:
76+
name: netobserv-alerts
77+
namespace: openshift-monitoring
78+
spec:
79+
groups:
80+
- name: NetObservAlerts
81+
rules:
82+
- alert: NetObservIncomingBandwidth
83+
annotations:
84+
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
85+
message: |-
86+
NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
87+
summary: "Surge in incoming traffic"
88+
expr: |-
89+
(100 *
90+
(
91+
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
92+
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
93+
)
94+
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
95+
> 100
96+
for: 1m
97+
labels:
98+
app: netobserv
99+
netobserv: "true"
100+
severity: warning
101+
----
102+
103+
where:
104+
105+
`spec.groups.rules.alert.labels.netobserv`::
106+
Specifies the alert for the *Network Health* dashboard to detect when set to `true`.
107+
`spec.groups.rules.alert.labels.severity`::
108+
Specifies the severity of the alert. The following values are valid: `critical`, `warning`, or `info`.
109+
110+
You can leverage the output labels from the defined `PromQL` expression in the `message` annotation. In the example, since results are grouped per `DstK8S_Namespace`, the expression `{{ $labels.DstK8S_Namespace }}` is used in the message text.
111+
112+
The `netobserv_io_network_health` annotation is optional, and controls how the alert is rendered on the *Network Health* page.
113+
114+
The `netobserv_io_network_health` annotation is a JSON string consisting of the following fields:
115+
116+
.Fields for the netobserv_io_network_health annotation
117+
[cols="2,2,6",options="header"]
118+
|===
119+
| Field
120+
| Type
121+
| Description
122+
123+
| `namespaceLabels`
124+
| List of strings
125+
| One or more labels that hold namespaces. When provided, the alert appears under the *Namespaces* tab.
126+
127+
| `nodeLabels`
128+
| List of strings
129+
| One or more labels that hold node names. When provided, the alert appears under the *Nodes* tab.
130+
131+
| `threshold`
132+
| String
133+
| The alert threshold, expected to match the threshold defined in the `PromQL` expression.
134+
135+
| `unit`
136+
| String
137+
| The data unit, used only for display purposes.
138+
139+
| `upperBound`
140+
| String
141+
| An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped.
142+
143+
| `links`
144+
| List of objects
145+
| A list of links to display contextually with the alert. Each link requires a `name` (display name) and `url`.
146+
147+
| `trafficLinkFilter`
148+
| String
149+
| An additional filter to inject into the URL for the *Network Traffic* page.
150+
|===
151+
152+
The `namespaceLabels` and `nodeLabels` are mutually exclusive. If neither is provided, the alert appears under the *Global* tab.
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-alerts.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="network-observability-alerts-about_{context}"]
7+
= About network observability alerts
8+
9+
[role="_abstract"]
10+
Network observability includes predefined alerts. Use these alerts to gain insight into the health and performance of your {product-title} applications and infrastructure.
11+
12+
The predefined alerts provide a quick health indication of your cluster's network in the *Network Health* dashboard. You can also customize alerts using Prometheus Query Language (PromQL) queries.
13+
14+
By default, network observability creates alerts that are contextual to the features you enable.
15+
16+
For example, packet drop-related alerts are created only if the `PacketDrop` agent feature is enabled in the `FlowCollector` custom resource (CR). Alerts are built on metrics, and you might see configuration warnings if enabled alerts are missing their required metrics.
17+
18+
You can configure these metrics in the `spec.processor.metrics.includeList` object of the `FlowCollector` CR.
19+
20+
[id="network-observability-default-alert-templates_{context}"]
21+
== List of default alert templates
22+
23+
These alert templates are installed by default:
24+
25+
`PacketDropsByDevice`:: Triggers on high percentage of packet drops from devices (`/proc/net/dev`).
26+
`PacketDropsByKernel`:: Triggers on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature.
27+
`IPsecErrors`:: Triggers when IPsec encryption errors are detected by network observability; it requires the `IPSec` agent feature.
28+
`NetpolDenied`:: Triggers when traffic denied by network policies is detected by network observability; it requires the `NetworkEvents` agent feature.
29+
`LatencyHighTrend`:: Triggers when an increase of TCP latency is detected by network observability; it requires the `FlowRTT` agent feature.
30+
`DNSErrors`:: Triggers when DNS errors are detected by network observability; it requires the `DNSTracking` agent feature.
31+
//* `ExternalEgressHighTrend`: TODO.
32+
//* `ExternalIngressHighTrend`: TODO.
33+
34+
These are operational alerts that relate to the self-health of network observability:
35+
36+
`NetObservNoFlows`:: Triggers when no flows are being observed for a certain period.
37+
`NetObservLokiError`:: Triggers when flows are being dropped due to Loki errors.
38+
39+
You can configure, extend, or disable alerts for network observability. You can view the resulting `PrometheusRule` resource in the default `netobserv` namespace by running the following command:
40+
41+
[source,terminal]
42+
----
43+
$ oc get prometheusrules -n netobserv -oyaml
44+
----
45+
46+
[id="network-health-dashboard_{context}"]
47+
== Network Health dashboard
48+
49+
When alerts are enabled in the Network Observability Operator, two things happen:
50+
51+
* New alerts appear in *Observe* → *Alerting* → *Alerting rules* tab in the {product-title} web console.
52+
* A new *Network Health* dashboard appears in {product-title} web console → *Observe*.
53+
54+
The *Network Health* dashboard provides a summary of triggered alerts and pending alerts, distinguishing between critical, warning, and minor issues. Alerts for rule violations are displayed in the following tabs:
55+
56+
* *Global*: Shows alerts that are global to the cluster.
57+
* *Nodes*: Shows alerts for rule violations per node.
58+
* *Namespaces*: Shows alerts for rule violations per namespace.
59+
60+
Click on a resource card to see more information. Next to each alert, a three dot menu appears. From this menu, you can navigate to *Network Traffic* → *Traffic flows* to see more detailed information for the selected resource.

modules/network-observability-con_filter-network-flows-at-ingestion.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,6 @@ spec:
7878
sampling: 10 <2>
7979
----
8080
<1> Sends matching flows to a specific output, such as Loki, Prometheus, or an external system. When omitted, sends to all configured outputs.
81-
<2> Optional. Applies a sampling ratio to limit the number of matching flows to be stored or exported. For example, `sampling: 10` means 1/10 of the flows are kept.
81+
<2> Optional. Applies a sampling interval to limit the number of matching flows to be stored or exported. For example, `sampling: 10` means that there is a 1 in 10 chance that a flow will be kept.
8282

8383

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
// Module included in the following assemblies:
2+
//
3+
// network_observability/network-observability-alerts.adoc
4+
5+
:_mod-docs-content-type: CONCEPT
6+
[id="network-observability-configuring-predefined-alerts_{context}"]
7+
= Configuring predefined alerts
8+
9+
[role="_abstract"]
10+
Alerts in the Network Observability Operator are defined using alert templates and variants in the `spec.processor.metrics.alerts` object of the `FlowCollector` custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.
11+
12+
After you enable alerts, the *Network Health* dashboard appears in the *Observe* section of the {product-title} web console.
13+
14+
For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see the "List of default alert templates".
15+
16+
Here is an example:
17+
18+
[source,yaml,subs="attributes,verbatim"]
19+
----
20+
apiVersion: flows.netobserv.io/v1beta1
21+
kind: FlowCollector
22+
metadata:
23+
name: flow-collector
24+
spec:
25+
processor:
26+
metrics:
27+
alerts:
28+
- template: PacketDropsByKernel
29+
variants:
30+
# triggered when the whole cluster traffic (no grouping) reaches 10% of drops
31+
- thresholds:
32+
critical: "10"
33+
# triggered when per-node traffic reaches 5% of drops, with gradual severity
34+
- thresholds:
35+
critical: "15"
36+
warning: "10"
37+
info: "5"
38+
groupBy: Node
39+
----
40+
41+
[NOTE]
42+
====
43+
Customizing an alert replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.
44+
====
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-alerts.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="network-observability-creating-custom-alert-rules_{context}"]
7+
= Creating custom alert rules
8+
9+
[role="_abstract"]
10+
Use the Prometheus Query Language (`PromQL`) to define a custom `AlertingRule` resource to trigger alerts based on specific network metrics (e.g., traffic surges).
11+
12+
.Prerequisites
13+
14+
* Familiarity with `PromQL`.
15+
* You have installed {product-title} 4.14 or later.
16+
* You have access to the cluster as a user with the `cluster-admin` role.
17+
* You have installed the Network Observability Operator.
18+
19+
.Procedure
20+
21+
. Create a YAML file named `custom-alert.yaml` that contains your `AlertingRule` resource.
22+
. Apply the custom alert rule by running the following command:
23+
+
24+
[source,terminal]
25+
----
26+
$ oc apply -f custom-alert.yaml
27+
----
28+
29+
.Verification
30+
31+
. Verify that the `PrometheusRule` resource was created in the `netobserv` namespace by running the following command:
32+
+
33+
[source,terminal]
34+
----
35+
$ oc get prometheusrules -n netobserv -oyaml
36+
----
37+
+
38+
The output should include the `netobserv-alerts` rule you just created, confirming that the resource was generated correctly.
39+
40+
. Confirm the rule is active by checking the *Network Health* dashboard in the {product-title} web console → *Observe*.

modules/network-observability-deploy-network-policy.adoc

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,14 @@
55

66
:_mod-docs-content-type: PROCEDURE
77
[id="network-observability-deploy-network-policy_{context}"]
8-
= Configuring an ingress network policy by using the FlowCollector custom resource
8+
= Configuring network policy by using the FlowCollector custom resource
99

10-
You can configure the `FlowCollector` custom resource (CR) to deploy an ingress network policy for network observability by setting the `spec.NetworkPolicy.enable` specification to `true`. By default, the specification is `false`.
10+
[role="_abstract"]
11+
You can set up ingress and egress network policies to control pod traffic. This enhances security and collects only the network flow data you need. This reduces noise, supports compliance, and improves visibility into network communication.
1112

12-
If you have installed Loki, Kafka or any exporter in a different namespace that also has a network policy, you must ensure that the Network Observability components can communicate with them. Consider the following about your setup:
13+
You can configure the `FlowCollector` custom resource (CR) to deploy an egress and ingress network policy for network observability. By default, the `spec.NetworkPolicy.enable` specification is set to `true`.
14+
15+
If you have installed Loki, Kafka or any exporter in a different namespace that also has a network policy, you must ensure that the network observability components can communicate with them. Consider the following about your setup:
1316

1417
* Connection to Loki (as defined in the `FlowCollector` CR `spec.loki` parameter)
1518
* Connection to Kafka (as defined in the `FlowCollector` CR `spec.kafka` parameter)
@@ -33,9 +36,9 @@ metadata:
3336
spec:
3437
namespace: netobserv
3538
networkPolicy:
36-
enable: true <1>
39+
enable: true <1>
3740
additionalNamespaces: ["openshift-console", "openshift-monitoring"] <2>
3841
# ...
3942
----
40-
<1> By default, the `enable` value is `false`.
43+
<1> By default, the `enable` value is `true`.
4144
<2> Default values are `["openshift-console", "openshift-monitoring"]`.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * network_observability/network-observability-alerts.adoc
4+
5+
:_mod-docs-content-type: REFERENCE
6+
[id="network-observability-disabling-predefined-alerts_{context}"]
7+
= Disabling predefined alerts
8+
9+
[role="_abstract"]
10+
Alert templates can be disabled in the `spec.processor.metrics.disableAlerts` field of the `FlowCollector` custom resource (CR). This setting accepts a list of alert template names. For a list of alert template names, see: "List of default alerts".
11+
12+
If a template is disabled and overridden in the `spec.processor.metrics.alerts` field, the disable setting takes precedence and the alert rule is not created.

0 commit comments

Comments
 (0)