Skip to content

Conversation

@tedzhouhk
Copy link
Contributor

@tedzhouhk tedzhouhk commented Dec 3, 2025

expose more metrics in planner

Summary by CodeRabbit

  • Refactor
    • Reorganized internal metrics collection infrastructure for improved code structure and maintainability. Metrics behavior remains unchanged when Prometheus monitoring is enabled or disabled.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk requested review from a team as code owners December 3, 2025 01:42
@github-actions github-actions bot added the feat label Dec 3, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

Walkthrough

Consolidates Prometheus metrics into a dedicated PlannerPrometheusMetrics class, replacing scattered Gauge instances across the Planner. Refactors metric updates in observe_metrics and make_adjustments methods to access gauges through the new container, maintaining metric gating behavior when Prometheus is disabled.

Changes

Cohort / File(s) Change Summary
Prometheus Metrics Consolidation
components/src/dynamo/planner/utils/planner_core.py
Introduces new PlannerPrometheusMetrics class to encapsulate all worker count, observed, correction factor, predicted load, and predicted replica gauges. Refactors Planner to replace individual Gauge declarations with single self.prometheus_metrics attribute. Rewires metric updates in observe_metrics and make_adjustments methods to access gauges via the container.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Verify all metric update paths have been correctly rewired from direct Gauge access to container access across observe_metrics and make_adjustments methods
  • Ensure PlannerPrometheusMetrics class initialization properly instantiates all required gauge instances
  • Confirm metric gating logic (prometheus_port != 0) is preserved throughout the refactored methods

Poem

🐰 Gauges once scattered, now neatly arranged,
In a metrics container, thoughtfully changed!
From direct to delegated, a cleaner design,
The Planner observes with metrics so fine. ✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description 'expose more metrics in planner' is vague and incomplete, missing all required template sections: Overview, Details, Where should the reviewer start, and Related Issues. Expand the description to follow the template structure with Overview, Details, reviewer guidance, and Related Issues sections.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: add more metrics in planner' is partially related to the changeset—it mentions metrics and the planner, but fails to capture the main architectural change: consolidating metrics into a dedicated PlannerPrometheusMetrics container.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135bce4 and f47e53f.

📒 Files selected for processing (1)
  • components/src/dynamo/planner/utils/planner_core.py (5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-16T00:26:43.641Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

  • components/src/dynamo/planner/utils/planner_core.py
📚 Learning: 2025-09-16T00:27:43.992Z
Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

  • components/src/dynamo/planner/utils/planner_core.py
🧬 Code graph analysis (1)
components/src/dynamo/planner/utils/planner_core.py (1)
lib/bindings/python/src/dynamo/prometheus_metrics.pyi (1)
  • Gauge (54-85)
🪛 Ruff (0.14.7)
components/src/dynamo/planner/utils/planner_core.py

519-519: Do not catch blind exception: Exception

(BLE001)


520-520: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: sglang (arm64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
components/src/dynamo/planner/utils/planner_core.py (4)

61-113: Clean consolidation of Prometheus metrics.

This is a well-structured container class that groups related metrics logically and improves maintainability over scattered Gauge declarations. The metric naming is consistent and the help strings are descriptive.


345-357: LGTM - Consistent with existing metric handling patterns.

The observed metrics are set after fetching from the Prometheus API, following the same assumptions as the existing logging statements at lines 338-342. The rate calculation at line 350 correctly converts request count to requests per second.


510-518: LGTM - Correction factor metrics properly gated and placed within exception handling.


526-542: LGTM - Predicted metrics properly guarded.

The predicted load metrics are only set after validating the values are not None (line 525), and the predicted replica metrics are set within the try block after successful computation. Both sections are properly gated behind the prometheus_port != 0 check.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants