Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 54 additions & 12 deletions helm-chart/dash0-operator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -707,8 +707,6 @@ By default, the operator collects metrics as follows:
in the Dash0 operator configuration resource (or setting the value
`operator.kubernetesInfrastructureMetricsCollectionEnabled` to `false` when deploying the operator configuration
resource via the Helm chart).
(Collecting node metrics via the host metrics receiver is not supported in
[GKE Autopilot clusters](#notes-on-gke-autopilot), the host metric receiver will be disabled there.)
* Namespace-scoped metrics (e.g. metrics related to a workload running in a specific namespace) will only be collected
if the namespace is monitored, that is, there is a Dash0 monitoring resource in that namespace.
* The Dash0 operator scrapes Prometheus endpoints on pods annotated with the `prometheus.io/*` annotations in monitored
Expand Down Expand Up @@ -982,7 +980,6 @@ similar to take effect.
All the pods deployed by the operator have a default node anti-affinity for the `dash0.com/enable=false` node label.
That is, if you add the `dash0.com/enable=false` label to a node, none of the pods owned by the operator will be
scheduled on that node.
(This features is not available on [GKE Autopilot clusters](#notes-on-gke-autopilot).)

**IMPORTANT:** This includes the daemonset that the operator will set up to receive telemetry from the pods, which might
leads to situations in which instrumented pods cannot send telemetry because the local node does not have a daemonset
Expand Down Expand Up @@ -2001,23 +1998,68 @@ operator:

GKE Autopilot [restricts](https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-security) what workloads
in an autopilot clusters can do.
By setting `operator.gke.autopilot.enabled` to true, the Dash0 operator Helm chart will adjust its own configuration
to comply with these restrictions.
In particular, this will:
- omit the `dash0.com/enable` node affinity rule (custom node affinities are not allowed in GKE Autopilot)
- disable the host metrics receiver, as it requires mounting the full host file system as a volume, which is not
permitted on GKE autopilot
- disable collecting all four utilization metrics for the `kubeletstats` receiver metrics; collecting these requires access to the
With `operator.gke.autopilot.enabled` set to `true`, the Dash0 operator Helm chart deploys an
`auto.gke.io/AllowlistSynchronizer` resource into the target cluster, which in turn will add the required
`auto.gke.io/WorkloadAllowlist` resources for Dash0 workloads (the operator and the OpenTelemetry collectors it
manages).
This allows the Dash0 operator to work on GKE Autopilot clusters.

Not all restrictions can be lifted via workload allowlist, the following features are not available on GKE Autopilot
clusters:
- collecting utilization metrics with the `kubeletstats` receiver is disabled; collecting these requires access to the
`/pod` endpoint of the kubelet API which is not available in GKE autopilot due to the lack of the `nodes/proxy`
permission:
- `k8s.pod.cpu_limit_utilization`,
- `k8s.pod.cpu_request_utilization`,
- `k8s.pod.memory_limit_utilization`, and
- `k8s.pod.memory_request_utilization`
- disable collecting the extra metadata labels `container.id` and `k8s.volume.type` for the `kubeletstats` receiver
metrics, collecting these requires access to the `/pod` endpoint of the kubelet API which is not available in GKE
- collecting the extra metadata labels `container.id` and `k8s.volume.type` for the `kubeletstats` receiver metrics is
disabled, collecting these requires access to the `/pod` endpoint of the kubelet API which is not available in GKE
autopilot due to the lack of the `nodes/proxy` permission

Please note that the `AllowlistSynchronizer` resource is not removed automatically with `helm uninstall dash0-operator`.
If you decide to later remove the Dash0 operator Helm release from the cluster, you might want to delete the
`AllowlistSynchronizer` manually afterward.
(Deleting the `AllowlistSynchronizer` will also delete all associated `WorkloadAllowlist` resources.)

### Managing the AllowlistSynchronizer Manually

As an alternative to letting the Helm chart install the `AllowlistSynchronizer`, you can also choose to manage this
manually, if you prefer:

```yaml
operator:
gke:
autopilot:
enabled: true
deployAllowListSynchronizer: false
```

With these settings, the Dash0 operator Helm chart will not deploy the `AllowlistSynchronizer`.
Using these settings requires that you deploy the Dash0 `AllowlistSynchronizer` before installing the Dash0 operator.
To do that, create the following file `dash0-gke-autopilot-allowlist-synchronizer.yaml`:
```yaml
apiVersion: auto.gke.io/v1
kind: AllowlistSynchronizer
metadata:
name: dash0-allowlist-synchronizer
spec:
allowlistPaths:
- Dash0/operator-manager/dash0-operator-manager-v1.0.0.yaml
- Dash0/post-install/dash0-post-install-v1.0.0.yaml
- Dash0/pre-delete/dash0-pre-delete-v1.0.0.yaml
- Dash0/opentelemetry-collector-agent/dash0-opentelemetry-collector-agent-v1.0.0.yaml
- Dash0/opentelemetry-cluster-metrics-collector/dash0-opentelemetry-cluster-metrics-collector-v1.0.0.yaml
```

Then deploy it as follows:
```
kubectl apply -f dash0-gke-autopilot-allowlist-synchronizer.yaml
```

When managing the `AllowlistSynchronizer` manually, you might need to update it from time to time for future Dash0
operator releases.

## Notes on Azure AKS

In [AKS](https://azure.microsoft.com/products/kubernetes-service) clusters that have the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,14 @@ rules:
- get
- list
- watch

# Permissions required for cleaning up the Dash0 AllowlistSynchronizer in GKE Autopilot clusters.
- apiGroups:
- auto.gke.io
resources:
- allowlistsynchronizers
verbs:
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,17 +84,18 @@ spec:
{{- if .Values.operator.podLabels }}
{{- include "dash0-operator.podLabels" . | nindent 8 }}
{{- end }}
{{- if .Values.operator.gke.autopilot.enabled }}
cloud.google.com/matching-allowlist: dash0-operator-manager-v1.0.0
{{- end }}
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
{{- if not .Values.operator.gke.autopilot.enabled }}
- key: "dash0.com/enable"
operator: "NotIn"
values: ["false"]
{{- end }}
- key: "kubernetes.io/os"
operator: "In"
values: ["linux"]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{{- if (and .Values.operator.gke.autopilot.enabled .Values.operator.gke.autopilot.deployAllowListSynchronizer) }}
apiVersion: auto.gke.io/v1
kind: AllowlistSynchronizer
metadata:
name: dash0-allowlist-synchronizer
annotations:
{{/*
We need to make sure that the AllowlistSynchronizer has been deployed (and is ready, i.e. has fetched the
referenced WorkloadAllowlist resources) before Helm tries to deploy the rest of our resources, in particular the
operator manager deployment. Unfortunately, Helm has no good out-of-the box support for defining dependencies like
this in a chart, but annotating a resource as a pre-install hook gives us the desired order.
The downside is that hook resources are not considered to be part of the release, so they are not removed with
helm uninstall. We take care of that in a delete hook job.
*/}}
helm.sh/hook: "pre-install,pre-upgrade"
helm.sh/hook-weight: "-1"
spec:
allowlistPaths:
- Dash0/operator-manager/dash0-operator-manager-v1.0.0.yaml
- Dash0/post-install/dash0-post-install-v1.0.0.yaml
- Dash0/pre-delete/dash0-pre-delete-v1.0.0.yaml
- Dash0/opentelemetry-collector-agent/dash0-opentelemetry-collector-agent-v1.0.0.yaml
- Dash0/opentelemetry-cluster-metrics-collector/dash0-opentelemetry-cluster-metrics-collector-v1.0.0.yaml
{{- end }}
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ metadata:
{{- include "dash0-operator.labels" . | nindent 4 }}
dash0.com/enable: "false"
annotations:
"helm.sh/hook": post-install
"helm.sh/hook-delete-policy": hook-succeeded
helm.sh/hook: post-install
helm.sh/hook-delete-policy: hook-succeeded
spec:
template:
metadata:
Expand All @@ -25,17 +25,18 @@ spec:
app.kubernetes.io/instance: post-install-hook
app.kubernetes.io/managed-by: {{ .Release.Service | quote }}
helm.sh/chart: {{ include "dash0-operator.chartNameWithVersion" . }}
{{- if .Values.operator.gke.autopilot.enabled }}
cloud.google.com/matching-allowlist: dash0-post-install-v1.0.0
{{- end }}
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
{{- if not .Values.operator.gke.autopilot.enabled }}
- key: "dash0.com/enable"
operator: "NotIn"
values: ["false"]
{{- end }}
- key: "kubernetes.io/os"
operator: "In"
values: ["linux"]
Expand All @@ -53,6 +54,7 @@ spec:
imagePullPolicy: {{ .Values.operator.image.pullPolicy }}
command:
- /manager
args:
- "--auto-operator-configuration-resource-available-check"
{{ include "dash0-operator.restrictiveContainerSecurityContext" dict | nindent 10 }}
resources:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ metadata:
{{- include "dash0-operator.labels" . | nindent 4 }}
dash0.com/enable: "false"
annotations:
"helm.sh/hook": pre-delete
"helm.sh/hook-delete-policy": hook-succeeded
helm.sh/hook: pre-delete
helm.sh/hook-delete-policy: hook-succeeded
spec:
template:
metadata:
Expand All @@ -21,18 +21,19 @@ spec:
app.kubernetes.io/instance: pre-delete-hook
app.kubernetes.io/managed-by: {{ .Release.Service | quote }}
helm.sh/chart: {{ include "dash0-operator.chartNameWithVersion" . }}
{{- if .Values.operator.gke.autopilot.enabled }}
cloud.google.com/matching-allowlist: dash0-pre-delete-v1.0.0
{{- end }}
spec:
restartPolicy: OnFailure
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
{{- if not .Values.operator.gke.autopilot.enabled }}
- key: "dash0.com/enable"
operator: "NotIn"
values: ["false"]
{{- end }}
- key: "kubernetes.io/os"
operator: "In"
values: ["linux"]
Expand All @@ -49,6 +50,7 @@ spec:
imagePullPolicy: {{ .Values.operator.image.pullPolicy }}
command:
- /manager
args:
- "--uninstrument-all"
{{ include "dash0-operator.restrictiveContainerSecurityContext" dict | nindent 10 }}
resources:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
should install the GKE Autopilot AllowlistSynchronizer if operator.gke.autopilot.enabled is true:
1: |
apiVersion: auto.gke.io/v1
kind: AllowlistSynchronizer
metadata:
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "-1"
name: dash0-allowlist-synchronizer
spec:
allowlistPaths:
- Dash0/operator-manager/dash0-operator-manager-v1.0.0.yaml
- Dash0/post-install/dash0-post-install-v1.0.0.yaml
- Dash0/pre-delete/dash0-pre-delete-v1.0.0.yaml
- Dash0/opentelemetry-collector-agent/dash0-opentelemetry-collector-agent-v1.0.0.yaml
- Dash0/opentelemetry-cluster-metrics-collector/dash0-opentelemetry-cluster-metrics-collector-v1.0.0.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ post-install hook job should match snapshot:
- linux
automountServiceAccountToken: true
containers:
- command:
- /manager
- args:
- --auto-operator-configuration-resource-available-check
command:
- /manager
image: ghcr.io/dash0hq/operator-controller:0.0.0
imagePullPolicy: null
name: post-install-job
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,10 @@ pre-delete hook job should match snapshot:
- linux
automountServiceAccountToken: true
containers:
- command:
- /manager
- args:
- --uninstrument-all
command:
- /manager
image: ghcr.io/dash0hq/operator-controller:0.0.0
imagePullPolicy: null
name: pre-delete-job
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
suite: test allowlist synchronizer
templates:
- operator/gke-autopilot-allowlist-synchronizer.yaml
tests:
- it: should not install the GKE Autopilot AllowlistSynchronizer if operator.gke.autopilot.enabled is false
asserts:
- hasDocuments:
count: 0

- it: should not install the GKE Autopilot AllowlistSynchronizer if operator.gke.autopilot.enabled is true but operator.gke.autopilot.deployAllowListSynchronizer is false
set:
operator:
gke:
autopilot:
enabled: true
deployAllowListSynchronizer: false
asserts:
- hasDocuments:
count: 0

- it: should install the GKE Autopilot AllowlistSynchronizer if operator.gke.autopilot.enabled is true
set:
operator:
gke:
autopilot:
enabled: true
asserts:
- hasDocuments:
count: 1
- matchSnapshot: {}
13 changes: 10 additions & 3 deletions helm-chart/dash0-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,7 @@ operator:
autopilot:
# Set operator.gke.autopilot.enabled=true if you are running the Dash0 operator in a GKE Autopilot cluster.
# This will:
# - omit the dash0.com/enable node affinity rule (custom node affinities are not allowed in GKE Autopilot)
# - disable the host metrics receiver, as it requires mounting the full host file system as a volume, which is not
# permitted on GKE autopilot
# - deploy an AllowlistSynchronizer into the cluster
# - disable collecting all four utilization metrics for the kubeletstats receiver metrics, collecting these
# requires access to the /pod endpoint of the kubelet API which is not available in GKE autopilot:
# - k8s.pod.cpu_limit_utilization
Expand All @@ -120,6 +118,15 @@ operator:
# add these namespaces anyway
enabled: false

# Let the Dash0 operator Helm chart automatically deploy the AllowlistSynchronizer resource to your cluster, if
# operator.gke.autopilot.enabled is true. This setting has no effect if operator.gke.autopilot.enabled is false.
# If set to false, the assumption is that you have deployed the Dash0 AllowlistSynchronizer resource yourself into
# your GKE Autopilot cluster.
# The default is true, there should usually be no reason to override this.
# See https://github.com/dash0hq/dash0-operator/blob/main/helm-chart/dash0-operator/README.md#notes-on-gke-autopilot
# for more information.
deployAllowListSynchronizer: true

# An array of tolerations for the operator manager deployment. This can be used to make sure that the operator manager
# pod(s) can be scheduled on nodes where they would not be scheduled otherwise due to Kubernetes taints.
# Example:
Expand Down
Loading