Skip to content

Conversation

@the-mann
Copy link

Description

Adding the following EFA metrics:

  • unresponsive_remote_events
  • impaired_remote_conn_events
  • retrans_timeout_events
  • retrans_pkts
  • retrans_bytes

for node, container, and pod level.

Testing

Manual testing:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cwa-efa-cluster
  region: us-west-2
  version: "1.33"

iam:
  withOIDC: true

availabilityZones: ["us-west-2a", "us-west-2c"]

managedNodeGroups:
  - name: my-efa-ng-2
    instanceType: c6in.32xlarge
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    availabilityZones: ["us-west-2c"]
    volumeSize: 300
    privateNetworking: true
    efaEnabled: true

save that to cluster.yaml, and run

eksctl create cluster -f cluster.yaml

install the cloudwatch observability EKS addon

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html

Then, deploy the following:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: my-training-job-2
  namespace: default
  labels:
    app: my-training-job-2
spec:
  selector:
    matchLabels:
      app: my-training-job-2
  template:
    metadata:
      labels:
        app: my-training-job-2
    spec:
      containers:
        - name: efa-device-holder
          image: busybox:latest
          command: ["/bin/sh", "-c", "sleep infinity"]
          resources:
            limits:
              vpc.amazonaws.com/efa: 1  # Request EFA device
            requests:
              vpc.amazonaws.com/efa: 1

finally, build a dev version of the cloudwatch agent with this branch + aws/amazon-cloudwatch-agent@3cc6b58

and check that the metrics exist:

image

- unresponsive_remote_events
- impaired_remote_conn_events
- retrans_timeout_events
- retrans_pkts
- retrans_bytes
@the-mann the-mann requested a review from mxiamxia as a code owner November 20, 2025 00:16
movence
movence previously approved these changes Nov 20, 2025
okankoAMZ
okankoAMZ previously approved these changes Nov 20, 2025
… counters and so they are converted to deltas before sending down the pipeline.
@the-mann the-mann dismissed stale reviews from okankoAMZ and movence via 063b939 November 21, 2025 17:31
@the-mann the-mann requested a review from duhminick November 24, 2025 19:26
@github-actions
Copy link

github-actions bot commented Dec 9, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 9, 2025
EfaRetransBytes = "retrans_bytes"
EfaRetransPkts = "retrans_pkts"
EfaRetransTimeoutEvents = "retrans_timeout_events"
EfaUnresponsiveRemoveEvents = "unresponsive_remote_events"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in const name

@the-mann the-mann merged commit 693698d into aws-cwa-dev Dec 10, 2025
274 of 276 checks passed
@the-mann the-mann deleted the mpmann/more-efa-metrics branch December 10, 2025 16:09
@the-mann the-mann mentioned this pull request Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants