diff --git a/content/integrate/redis-data-integration/observability.md b/content/integrate/redis-data-integration/observability.md index 5235950c5f..e3d7be797d 100644 --- a/content/integrate/redis-data-integration/observability.md +++ b/content/integrate/redis-data-integration/observability.md @@ -25,8 +25,12 @@ to query the metrics and plot simple graphs or with [Grafana](https://grafana.com/) to produce more complex visualizations and dashboards. -RDI exposes two endpoints, one for [CDC collector metrics](#collector-metrics) and -another for [stream processor metrics](#stream-processor-metrics). +RDI exposes metrics for two main components: + - [CDC collector](#collector-metrics) + - [Stream processor](#stream-processor-metrics) + - [Operator](#operator-metrics) + + The sections below explain these sets of metrics in more detail. See the [architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}}) @@ -37,9 +41,67 @@ RDI metrics with the RDI monitoring screen in Redis Insight or with the [`redis-di status`]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-status" >}}) command from the CLI.{{< /note >}} -## Collector metrics +## Accessing the metrics + +The way you access the metrics endpoints depends on whether you are using a VM installation or a Helm installation for RDI. The sections below describe the correct approach for each installation type. + +### VM Installation + +For VM installations, the metrics are available by default on the following endpoints: +- Collector metrics: `https:///collector-source/metrics` +- Stream processor metrics: `https:///metrics` +- Operator metrics: `https:///operator/metrics` + +Please note that for RDI versions prior to 1.16.0 the collector metrics are not accessible. + +### Helm installation + +For Helm installations, the metrics are available via autodiscovery in the K8s cluster. Follow the steps below to use them: +1. Make sure you have the Prometheus Operator installed in your K8s cluster (see the + [Prometheus Operator installation guide](https://prometheus-operator.dev/docs/getting-started/installation/) for more information about this). + +2. Update your values.yaml file to enable metrics for the operator, collector and stream processor components. + + - For the collector, update the `collector` section, under the `dataPlane` section: + ```yaml + dataPlane: + collector: + # Enable service monitor + serviceMonitor: + enabled: true + + # Make sure to label the ServiceMonitor so that Prometheus can discover it + labels: + release: prometheus + ``` -The endpoint for the collector metrics is `https:///metrics/collector-source` + - For the stream processor, update the `rdiMetricsExporter` section: + ```yaml + rdiMetricsExporter: + # Enable service monitor + serviceMonitor: + enabled: true + + # Make sure to label the ServiceMonitor so that Prometheus can discover it + labels: + release: prometheus + ``` + + - For the operator, update the `operator` section: + ```yaml + operator: + prometheus: + enabled: true + labels: + release: prometheus + metrics: + enabled: true + ``` + +{{< note >}}The Prometheus service discovery loop runs at regular intervals. This means that after deploying or updating RDI with the above configuration, it may take a few minutes for Prometheus to discover the new ServiceMonitors and start scraping metrics from the RDI components. +{{< /note >}} + +## Collector metrics These metrics are divided into three groups: @@ -89,10 +151,8 @@ The following table lists all collector metrics and their descriptions: {{< note >}} Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state. {{< /note >}} - -## Stream processor metrics -The endpoint for the stream processor metrics is `https:///metrics/rdi` +## Stream processor metrics RDI reports metrics during the two main phases of the ingest pipeline, the *snapshot* phase and the *change data capture (CDC)* phase. (See the @@ -145,6 +205,16 @@ RDI reports with their descriptions. - **Last batch metrics**: Show real-time performance data for the most recently processed batch {{< /note >}} +## Operator metrics +Most of the metrics exposed by the RDI operator are standard controller-runtime [metrics](https://book.kubebuilder.io/reference/metrics-reference). +The metrics that are relevant for RDI operations are listed in the table below: + +| Metric Name | Metric Type | Metric Description | Alerting Recommendations | +|-------------|-------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------| +| `rdi_operator_is_leader` | Gauge | Current leadership status (1 = leader, 0 = not leader) | Informational - may be used for alerting there is no leader for prolonged period of time. | +| `rdi_operator_pipeline_phase` | Gauge | Current Pipeline phase | Informational - may be used for alerting if operator is stuck in the `Resetting` phase for a prolonged period of time | + + ## Recommended alerting strategy The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most other metrics are informational, so you should monitor them for trends rather than trigger alerts.