Skip to content

Commit c3a56b5

Browse files
authored
Merge pull request #99888 from lcavalle/TELCODOCS-2171-observability
TELCODOCS-2171#Generalize Day2Ops Observability
2 parents 079a63d + eaf79a6 commit c3a56b5

File tree

8 files changed

+64
-61
lines changed

8 files changed

+64
-61
lines changed

_topic_maps/_topic_map.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3624,7 +3624,7 @@ Topics:
36243624
Dir: observability
36253625
Topics:
36263626
- Name: Observability in OpenShift Container Platform
3627-
File: telco-observability
3627+
File: observability
36283628
- Name: Security
36293629
Dir: security
36303630
Topics:

edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc renamed to edge_computing/day_2_core_cnf_clusters/observability/observability.adoc

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
:_mod-docs-content-type: ASSEMBLY
2-
[id="telco-observability"]
3-
= Observability in telco core CNF clusters
2+
[id="observability"]
3+
= Observability in {product-title} clusters
44
include::_attributes/common-attributes.adoc[]
5-
:context: telco-observability
5+
:context: observability
66
:imagesdir: images
77

88
toc::[]
@@ -13,7 +13,7 @@ What follows is an outline of best practices for system engineers, architects, a
1313

1414
Unless explicitly stated, the material in this document refers to both Edge and Core deployments.
1515

16-
include::modules/telco-observability-monitoring-stack.adoc[leveloffset=+1]
16+
include::modules/observability-monitoring-stack.adoc[leveloffset=+1]
1717

1818
[role="_additional-resources"]
1919
.Additional resources
@@ -22,7 +22,7 @@ include::modules/telco-observability-monitoring-stack.adoc[leveloffset=+1]
2222
2323
* xref:../../../observability/monitoring/getting-started/core-platform-monitoring-first-steps.adoc#core-platform-monitoring-first-steps[Core platform monitoring first steps]
2424
25-
include::modules/telco-observability-key-performance-metrics.adoc[leveloffset=+1]
25+
include::modules/observability-key-performance-metrics.adoc[leveloffset=+1]
2626

2727
[role="_additional-resources"]
2828
.Additional resources
@@ -31,16 +31,16 @@ include::modules/telco-observability-key-performance-metrics.adoc[leveloffset=+1
3131
3232
* xref:../../../storage/persistent_storage_local/persistent-storage-local.adoc#local-storage-install_persistent-storage-local[Persistent storage using local volumes]
3333
34-
include::modules/telco-observability-monitoring-the-edge.adoc[leveloffset=+1]
34+
include::modules/observability-monitoring-the-edge.adoc[leveloffset=+1]
3535

36-
include::modules/telco-observability-alerting.adoc[leveloffset=+1]
36+
include::modules/observability-alerting.adoc[leveloffset=+1]
3737

3838
[role="_additional-resources"]
3939
.Additional resources
4040

4141
* xref:../../../observability/monitoring/about-ocp-monitoring/key-concepts.adoc#about-managing-alerts_key-concepts[Managing alerts]
4242
43-
include::modules/telco-observability-workload-monitoring.adoc[leveloffset=+1]
43+
include::modules/observability-workload-monitoring.adoc[leveloffset=+1]
4444

4545
[role="_additional-resources"]
4646
.Additional resources

edge_computing/day_2_core_cnf_clusters/telco-day-2-welcome.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Troubleshooting and maintaining telco core CNF clusters:: To maintain and troubl
1515

1616
Observability in telco core CNF clusters:: {product-title} generates a large amount of data, such as performance metrics and logs from the platform and the workloads running on it.
1717
As an administrator, you can use tools to collect and analyze the available data.
18-
For more information, see xref:../day_2_core_cnf_clusters/observability/telco-observability.adoc#telco-observability[Observability in telco core CNF clusters].
18+
For more information, see xref:../day_2_core_cnf_clusters/observability/observability.adoc#observability[Observability in {product-title}].
1919

2020
Security:: You can enhance security for high-bandwidth network deployments in telco environments by following key security considerations.
2121
For more information, see xref:../day_2_core_cnf_clusters/security/telco-security-basics.adoc#telco-security-basics[Security basics].
Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,30 @@
11
// Module included in the following assemblies:
22
//
3-
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
3+
// * edge_computing/day_2_core_cnf_clusters/observability/observability.adoc
44

55
:_mod-docs-content-type: PROCEDURE
6-
[id="telco-observability-alerting_{context}"]
6+
[id="observability-alerting_{context}"]
77

88
= Alerting
99

1010
{product-title} includes a large number of alert rules, which can change from release to release.
1111

12-
[id="viewing-default-alerts"]
12+
[id="viewing-default-alerts_{context}"]
1313
== Viewing default alerts
1414

15-
Use the following procedure to review all of the alert rules in a cluster.
15+
Review all of the alert rules in a cluster.
1616

1717
.Procedure
1818

19-
* To review all the alert rules in a cluster, you can run the following command:
19+
* To review all the alert rules in a cluster, run the following command:
2020
[source,terminal]
2121
+
2222
----
2323
$ oc get cm -n openshift-monitoring prometheus-k8s-rulefiles-0 -o yaml
2424
----
2525
+
2626
Rules can include a description and provide a link to additional information and mitigation steps.
27-
For example, this is the rule for `etcdHighFsyncDurations`:
27+
For example, see the rule for `etcdHighFsyncDurations`:
2828
+
2929
[source,terminal]
3030
----
@@ -43,11 +43,12 @@ For example, this is the rule for `etcdHighFsyncDurations`:
4343
----
4444

4545
[id="alert-notifications"]
46-
== Alert notifications
47-
You can view alerts in the {product-title} console, however an administrator should configure an external receiver to forward the alerts to.
46+
== Alert notifications
47+
48+
You can view alerts in the {product-title} console. However, an administrator must configure an external receiver to forward the alerts to.
4849
{product-title} supports the following receiver types:
4950

50-
* PagerDuty: a 3rd party incident response platform
51-
* Webhook: an arbitrary API endpoint that receives an alert via a POST request and can take any necessary action
52-
* Email: sends an email to designated address
53-
* Slack: sends a notification to either a slack channel or an individual user
51+
PagerDuty:: A third-party incident response platform.
52+
Webhook:: An arbitrary API endpoint that receives an alert through a `POST` request and can take any necessary action.
53+
Email:: Sends an email to a designated address.
54+
Slack:: Sends a notification to either a Slack channel or an individual user.

modules/telco-observability-key-performance-metrics.adoc renamed to modules/observability-key-performance-metrics.adoc

Lines changed: 21 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
// Module included in the following assemblies:
22
//
3-
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
3+
// * edge_computing/day_2_core_cnf_clusters/observability/observability.adoc
44

55
:_mod-docs-content-type: CONCEPT
6-
[id="telco-observability-key-performance-metrics_{context}"]
6+
[id="observability-key-performance-metrics_{context}"]
77
= Key performance metrics
88

9-
Depending on your system, there can be hundreds of available measurements.
9+
Depending on your system, you can have hundreds of available measurements.
1010

11-
Here are some key metrics that you should pay attention to:
11+
Consider the following key metrics:
1212

1313
* `etcd` response times
1414
* API response times
@@ -17,26 +17,30 @@ Here are some key metrics that you should pay attention to:
1717
* OVN health
1818
* Overall cluster operator health
1919
20-
A good rule to follow is that if you decide that a metric is important, there should be an alert for it.
20+
If a metric is important, set up an alert for it.
2121

2222
[NOTE]
2323
====
2424
You can check the available metrics by running the following command:
25+
26+
+
2527
[source,terminal]
2628
----
2729
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -qsk http://localhost:9090/api/v1/metadata | jq '.data
2830
----
2931
====
3032

31-
[id="example-queries-promql"]
33+
[id="example-queries-promql_{context}"]
3234
== Example queries in PromQL
3335

34-
The following tables show some queries that you can explore in the metrics query browser using the {product-title} console.
36+
Using the {product-title} console, you can explore the following queries in the metrics query browser.
3537

3638
[NOTE]
3739
====
3840
The URL for the console is https://<OpenShift Console FQDN>/monitoring/query-browser.
39-
You can get the OpenShift Console FQDN by running the following command:
41+
You can get the Openshift Console FQDN by running the following command:
42+
43+
+
4044
[source,terminal]
4145
----
4246
$ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].host}'
@@ -79,7 +83,7 @@ $ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].ho
7983
|`POST`
8084
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="POST"}[60m])))`
8185

82-
|`LIST`
86+
|`LIST`
8387
|`histogram_quantile (0.99, sum by (le,managed_cluster) (sum_over_time(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver\|openshift-apiserver", verb="LIST"}[60m])))`
8488

8589
|`PUT`
@@ -127,17 +131,15 @@ $ oc get routes -n openshift-console console -o jsonpath='{.status.ingress[0].ho
127131

128132
|===
129133

130-
[id="recommendations-for-storage-of-metrics"]
134+
[id="recommendations-for-storage-of-metrics_{context}"]
131135
== Recommendations for storage of metrics
132136

133-
Out of the box, Prometheus does not back up saved metrics with persistent storage.
134-
If you restart the Prometheus pods, all metrics data are lost.
135-
You should configure the monitoring stack to use the back-end storage that is available on the platform.
136-
To meet the high IO demands of Prometheus you should use local storage.
137-
138-
For Telco core clusters, you can use the Local Storage Operator for persistent storage for Prometheus.
137+
By default, Prometheus does not back up saved metrics with persistent storage.
138+
If you restart the Prometheus pods, all metrics data are lost.
139+
You must configure the monitoring stack to use the back-end storage that is available on the platform.
140+
To meet the high IO demands of Prometheus, use local storage.
139141

140-
{odf-first}, which deploys a ceph cluster for block, file, and object storage, is also a suitable candidate for a Telco core cluster.
142+
For smaller clusters, you can use the Local Storage Operator for persistent storage for Prometheus. {odf-first}, which deploys a ceph cluster for block, file, and object storage, is suitable for larger clusters.
141143

142-
To keep system resource requirements low on a RAN {sno} or far edge cluster, you should not provision backend storage for the monitoring stack.
143-
Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.
144+
To keep system resource requirements low on a {sno} cluster, do not provision back-end storage for the monitoring stack.
145+
Such clusters forward all metrics to the hub cluster where you can provision a third party monitoring platform.

modules/telco-observability-monitoring-stack.adoc renamed to modules/observability-monitoring-stack.adoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
// Module included in the following assemblies:
22
//
3-
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
3+
// * edge_computing/day_2_core_cnf_clusters/observability/observability.adoc
44

55
:_mod-docs-content-type: CONCEPT
6-
[id="telco-observability-monitoring-stack_{context}"]
6+
[id="observability-monitoring-stack_{context}"]
77
= Understanding the monitoring stack
88

99
The monitoring stack uses the following components:
@@ -17,5 +17,5 @@ image::monitoring-architecture.png[{product-title} monitoring architecture]
1717

1818
[NOTE]
1919
====
20-
For a {sno} cluster, you should disable Alertmanager and Thanos because the cluster sends all metrics to the hub cluster for analysis and retention.
20+
For {sno} clusters, disable Alertmanager and Thanos because the clusters sends all metrics to the hub cluster for analysis and retention.
2121
====

modules/telco-observability-monitoring-the-edge.adoc renamed to modules/observability-monitoring-the-edge.adoc

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
// Module included in the following assemblies:
22
//
3-
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
3+
// * edge_computing/day_2_core_cnf_clusters/observability/observability.adoc
44

55
:_mod-docs-content-type: PROCEDURE
6-
[id="telco-observability-monitoring-the-edge_{context}"]
6+
[id="observability-monitoring-the-edge_{context}"]
77

8-
= Monitoring the edge
8+
= Monitoring at the far edge network
99

10-
{sno-caps} at the edge keeps the footprint of the platform components to a minimum.
11-
The following procedure is an example of how you can configure a {sno} node with a small monitoring footprint.
10+
{product-title} clusters at the edge must keep the footprint of the platform components to a minimum.
11+
The following procedure is an example of how to configure a {sno} or a node at the far edge network with a small monitoring footprint.
1212

1313
.Prerequisites
1414

@@ -36,14 +36,14 @@ metadata:
3636
retention: 24h
3737
----
3838

39-
. On the {sno}, apply the `ConfigMap` CR by running the following command:
39+
. Apply the `ConfigMap` CR by running the following command on the {sno} cluster:
4040
+
4141
[source,terminal]
4242
----
4343
$ oc apply -f monitoringConfigMap.yaml
4444
----
4545

46-
. Create a `NameSpace` CR, and save it as `monitoringNamespace.yaml`, as in the following example:
46+
. Create a `Namespace` CR, and save it as `monitoringNamespace.yaml`, as in the following example:
4747
+
4848
[source,yaml]
4949
----
@@ -53,7 +53,7 @@ metadata:
5353
name: open-cluster-management-observability
5454
----
5555

56-
. On the hub cluster, apply the `Namespace` CR on the hub cluster by running the following command:
56+
. Apply the `Namespace` CR by running the following command on the hub cluster :
5757
+
5858
[source,terminal]
5959
----
@@ -75,7 +75,7 @@ spec:
7575
generateBucketName: acm-multi
7676
----
7777

78-
. On the hub cluster, apply the `ObjectBucketClaim` CR, by running the following command:
78+
. Apply the `ObjectBucketClaim` CR by running the following command on the hub cluster:
7979
+
8080
[source,terminal]
8181
----
@@ -95,14 +95,14 @@ stringData:
9595
.dockerconfigjson: 'PULL_SECRET'
9696
----
9797

98-
. On the hub cluster, apply the `Secret` CR by running the following command:
98+
. Apply the `Secret` CR by running the following command in the hub cluster:
9999
+
100100
[source,terminal]
101101
----
102102
$ oc apply -f monitoringSecret.yaml
103103
----
104104

105-
. Get the keys for the NooBaa service and the backend bucket name from the hub cluster by running the following commands:
105+
. Get the keys for the NooBaa service and the back-end bucket name from the hub cluster by running the following commands:
106106
+
107107
[source,terminal]
108108
----
@@ -140,7 +140,7 @@ stringData:
140140
secret_key: ${NOOBAA_SECRET_KEY}
141141
----
142142

143-
. On the hub cluster, apply the `Secret` CR by running the following command:
143+
. Apply the `Secret` CR by running the following command on the hub cluster:
144144
+
145145
[source,terminal]
146146
----
@@ -177,7 +177,7 @@ spec:
177177
storeStorageSize: 25Gi
178178
----
179179

180-
. On the hub cluster, apply the `MultiClusterObservability` CR by running the following command:
180+
. Apply the `MultiClusterObservability` CR by running the following command on the hub cluster:
181181
+
182182
[source,terminal]
183183
----

modules/telco-observability-workload-monitoring.adoc renamed to modules/observability-workload-monitoring.adoc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
// Module included in the following assemblies:
22
//
3-
// * edge_computing/day_2_core_cnf_clusters/observability/telco-observability.adoc
3+
// * edge_computing/day_2_core_cnf_clusters/observability/observability.adoc
44

55
:_mod-docs-content-type: PROCEDURE
6-
[id="telco-observability-workload-monitoring_{context}"]
6+
[id="observability-workload-monitoring_{context}"]
77
= Workload monitoring
88

99
By default, {product-title} does not collect metrics for application workloads. You can configure a cluster to collect workload metrics.
@@ -67,8 +67,8 @@ spec:
6767
$ oc apply -f monitoringServiceMonitor.yaml
6868
----
6969

70-
Prometheus scrapes the path `/metrics` by default, however you can define a custom path.
71-
It is up to the vendor of the application to expose this endpoint for scraping, with metrics that they deem relevant.
70+
Prometheus scrapes the `/metrics` path by default. However, you can define a custom path.
71+
The vendor of the application must decide whether to expose the endpoint for scraping, with metrics that they deem relevant.
7272

7373
== Creating a workload alert
7474

0 commit comments

Comments
 (0)