feat: add more metrics in planner #4710

tedzhouhk · 2025-12-03T01:42:27Z

expose more metrics in planner

Summary by CodeRabbit

Refactor
- Reorganized internal metrics collection infrastructure for improved code structure and maintainability. Metrics behavior remains unchanged when Prometheus monitoring is enabled or disabled.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai · 2025-12-03T01:45:57Z

Walkthrough

Consolidates Prometheus metrics into a dedicated PlannerPrometheusMetrics class, replacing scattered Gauge instances across the Planner. Refactors metric updates in observe_metrics and make_adjustments methods to access gauges through the new container, maintaining metric gating behavior when Prometheus is disabled.

Changes

Cohort / File(s)	Change Summary
Prometheus Metrics Consolidation `components/src/dynamo/planner/utils/planner_core.py`	Introduces new `PlannerPrometheusMetrics` class to encapsulate all worker count, observed, correction factor, predicted load, and predicted replica gauges. Refactors `Planner` to replace individual Gauge declarations with single `self.prometheus_metrics` attribute. Rewires metric updates in `observe_metrics` and `make_adjustments` methods to access gauges via the container.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Verify all metric update paths have been correctly rewired from direct Gauge access to container access across observe_metrics and make_adjustments methods
Ensure PlannerPrometheusMetrics class initialization properly instantiates all required gauge instances
Confirm metric gating logic (prometheus_port != 0) is preserved throughout the refactored methods

Poem

🐰 Gauges once scattered, now neatly arranged,
In a metrics container, thoughtfully changed!
From direct to delegated, a cleaner design,
The Planner observes with metrics so fine. ✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description 'expose more metrics in planner' is vague and incomplete, missing all required template sections: Overview, Details, Where should the reviewer start, and Related Issues.	Expand the description to follow the template structure with Overview, Details, reviewer guidance, and Related Issues sections.
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: add more metrics in planner' is partially related to the changeset—it mentions metrics and the planner, but fails to capture the main architectural change: consolidating metrics into a dedicated PlannerPrometheusMetrics container.

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135bce4 and f47e53f.

📒 Files selected for processing (1)

components/src/dynamo/planner/utils/planner_core.py (5 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-16T00:26:43.641Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/examples/system_metrics/README.md:65-65
Timestamp: 2025-09-16T00:26:43.641Z
Learning: The team at ai-dynamo/dynamo prefers to use consistent metric naming patterns with _total suffixes across all metric types (including gauges) for internal consistency, even when this differs from strict Prometheus conventions that reserve _total for counters only. This design decision was confirmed by keivenchang in PR 3035, referencing examples in prometheus_names.rs and input from team members.

Applied to files:

components/src/dynamo/planner/utils/planner_core.py

📚 Learning: 2025-09-16T00:27:43.992Z

Learnt from: keivenchang
Repo: ai-dynamo/dynamo PR: 3035
File: lib/runtime/src/pipeline/network/ingress/push_handler.rs:75-79
Timestamp: 2025-09-16T00:27:43.992Z
Learning: In the ai-dynamo/dynamo codebase, the project uses "_total" suffix for all Prometheus metrics including gauges like inflight_requests, which differs from standard Prometheus conventions. The constant work_handler::INFLIGHT_REQUESTS does not exist - only work_handler::INFLIGHT_REQUESTS_TOTAL exists and should be used for the inflight requests gauge metric.

Applied to files:

components/src/dynamo/planner/utils/planner_core.py

🧬 Code graph analysis (1)

components/src/dynamo/planner/utils/planner_core.py (1)

lib/bindings/python/src/dynamo/prometheus_metrics.pyi (1)

Gauge (54-85)

🪛 Ruff (0.14.7)

components/src/dynamo/planner/utils/planner_core.py

519-519: Do not catch blind exception: Exception

(BLE001)

520-520: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: sglang (arm64)
GitHub Check: trtllm (arm64)
GitHub Check: operator (arm64)
GitHub Check: vllm (arm64)
GitHub Check: operator (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

components/src/dynamo/planner/utils/planner_core.py (4)

61-113: Clean consolidation of Prometheus metrics.

This is a well-structured container class that groups related metrics logically and improves maintainability over scattered Gauge declarations. The metric naming is consistent and the help strings are descriptive.

345-357: LGTM - Consistent with existing metric handling patterns.

The observed metrics are set after fetching from the Prometheus API, following the same assumptions as the existing logging statements at lines 338-342. The rate calculation at line 350 correctly converts request count to requests per second.

510-518: LGTM - Correction factor metrics properly gated and placed within exception handling.

526-542: LGTM - Predicted metrics properly guarded.

The predicted load metrics are only set after validating the values are not None (line 525), and the predicted replica metrics are set within the try block after successful computation. Both sections are properly gated behind the prometheus_port != 0 check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

feat: add more metric in planner

f47e53f

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk requested review from a team as code owners December 3, 2025 01:42

pull-request-size bot added the size/L label Dec 3, 2025

github-actions bot added the feat label Dec 3, 2025

use prefix

7e8fecc

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:49 Inactive

add gpu hour

de2e46c

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 01:59 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 02:00 Inactive

PeaBrane approved these changes Dec 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add more metrics in planner #4710

feat: add more metrics in planner #4710

Uh oh!

tedzhouhk commented Dec 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add more metrics in planner #4710

Are you sure you want to change the base?

feat: add more metrics in planner #4710

Uh oh!

Conversation

tedzhouhk commented Dec 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 3, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tedzhouhk commented Dec 3, 2025 •

edited by coderabbitai bot

Loading