Skip to content

Conversation

@atchernych
Copy link
Contributor

@atchernych atchernych commented Dec 2, 2025

Overview:

Start ing Discovery daemon lazily would allow not to run the discovery daemon for prefill worker.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Refactor
    • Updated Kubernetes discovery daemon initialization to use a lazy, on-demand approach instead of eager startup. The daemon now starts only when needed, improving application startup performance.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
@atchernych atchernych requested a review from a team as a code owner December 2, 2025 20:43
@atchernych atchernych marked this pull request as draft December 2, 2025 20:43
@github-actions github-actions bot added the feat label Dec 2, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

Refactored Kubernetes discovery daemon from eager startup to lazy, on-demand initialization. Restructured KubeDiscoveryClient with additional fields, replaced derived Clone with manual implementation, introduced DaemonHandles type, and added a metadata_watch() method that spawns the daemon only when first invoked.

Changes

Cohort / File(s) Summary
Kubernetes Discovery Daemon Initialization
lib/runtime/src/discovery/kube.rs
Transitioned daemon startup from eager (during initialization) to lazy (on first metadata watch request). Expanded KubeDiscoveryClient with kube_client, pod_info, cancel_token, and daemon_state fields. Removed #[derive(Clone)] and implemented explicit Clone trait. Introduced internal DaemonHandles type to encapsulate watch mechanism. Added async metadata_watch() method that initializes and manages daemon lifecycle via OnceCell. Updated list and list_and_watch methods to lazily invoke metadata_watch() instead of using pre-bound watch channel.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

  • Thread-safety considerations: Verify OnceCell and Arc usage patterns correctly handle concurrent access and cloning across async boundaries
  • Error handling in lazy initialization: Ensure metadata_watch() error cases are properly propagated and don't cause silent daemon startup failures
  • Clone semantics: Confirm manual Clone implementation correctly duplicates all fields, especially Arc and CancellationToken
  • Caller compatibility: Review impact on existing callers of list/list_and_watch to ensure they handle lazy initialization behavior

Poem

🐰 A daemon sleeps until called upon,
No eager spawning at the break of dawn,
OnceCell keeps watch with gentle care,
Lazy initialization everywhere!
Hop, clone, and share with threads so fleet,

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The PR description follows the required template structure but lacks substantive content in critical sections (Details and Where should the reviewer start), with only placeholder text in Related Issues. Complete the Details section describing implementation changes, add guidance on where reviewers should focus, and replace the placeholder issue reference with the actual GitHub issue number.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Start Discovery daemon lazily' directly and accurately describes the main change: converting eager daemon startup to lazy on-demand initialization.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
lib/runtime/src/discovery/kube.rs (2)

74-85: Manual Clone implementation is correct and ensures shared daemon state.

All clones of KubeDiscoveryClient will share the same daemon_state, meaning the daemon is only started once regardless of how many clones exist. Consider adding a brief doc comment to make this sharing behavior explicit for future maintainers.


340-362: Lazy initialization via get_or_try_init is correctly implemented.

The pattern ensures thread-safe one-time initialization, properly propagates errors from DiscoveryDaemon::new(), and logs the startup event.

One observation: the spawned daemon's JoinHandle is dropped (fire-and-forget pattern). While the daemon uses cancel_token for shutdown signaling, storing the handle in DaemonHandles could enable graceful shutdown waiting or panic detection in the future.

 #[derive(Debug)]
 struct DaemonHandles {
     metadata_watch: tokio::sync::watch::Receiver<Arc<MetadataSnapshot>>,
+    #[allow(dead_code)] // Retained for future graceful shutdown
+    daemon_handle: tokio::task::JoinHandle<()>,
 }
-                tokio::spawn(async move {
+                let daemon_handle = tokio::spawn(async move {
                     if let Err(e) = daemon.run(watch_tx).await {
                         tracing::error!("Discovery daemon failed: {}", e);
                     }
                 });
                 // ... 
                 Ok::<Arc<DaemonHandles>, anyhow::Error>(Arc::new(DaemonHandles {
                     metadata_watch: watch_rx,
+                    daemon_handle,
                 }))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03070fd and 2c80b06.

📒 Files selected for processing (1)
  • lib/runtime/src/discovery/kube.rs (5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.

Applied to files:

  • lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally. This is a consistent pattern used throughout the codebase in multiple critical modules.

Applied to files:

  • lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally.

Applied to files:

  • lib/runtime/src/discovery/kube.rs
🧬 Code graph analysis (1)
lib/runtime/src/discovery/kube.rs (2)
lib/runtime/src/distributed.rs (1)
  • new (101-297)
lib/runtime/src/discovery/kube/daemon.rs (1)
  • new (38-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: tests (.)
🔇 Additional comments (5)
lib/runtime/src/discovery/kube.rs (5)

25-37: Struct redesign for lazy initialization looks correct.

The Arc<OnceCell<Arc<DaemonHandles>>> pattern is appropriate: the outer Arc allows sharing the OnceCell across clones of KubeDiscoveryClient, while OnceCell::get_or_try_init ensures thread-safe one-time initialization.


63-71: Constructor correctly defers daemon startup.

The initialization now stores the necessary components (kube_client, pod_info, cancel_token) without starting the daemon, enabling lazy startup on first use.


87-91: Watch receiver cloning pattern is correct.

Cloning a tokio::sync::watch::Receiver creates an independent view of the same channel, allowing multiple concurrent watchers.


159-166: Verify behavior: first call may return empty results before daemon fetches data.

The comment on line 165 correctly notes the snapshot "may be empty if daemon hasn't fetched yet." Since metadata_watch() returns immediately after spawning the daemon (without waiting for initial data fetch), the first call to list() could return an empty result even when instances exist.

Is this behavior acceptable for all callers? If callers depend on getting current state from the first list() call, consider either:

  1. Documenting this behavior clearly in the method's doc comments
  2. Having the daemon signal readiness after the initial fetch (if blocking is acceptable)

198-199: Lazy startup integration looks correct for the watch pattern.

Unlike list(), the watch pattern naturally handles the initially empty snapshot because subsequent changes will be delivered via the watch channel as the daemon fetches data.

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 2, 2025
@atchernych atchernych changed the title feat: Start Discover daemon lazily feat: Start Discovery daemon lazily Dec 2, 2025
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
@atchernych atchernych marked this pull request as ready for review December 2, 2025 22:32
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants