feat: Start Discovery daemon lazily #4705

atchernych · 2025-12-02T20:43:37Z

Overview:

Start ing Discovery daemon lazily would allow not to run the discovery daemon for prefill worker.

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

Refactor
- Updated Kubernetes discovery daemon initialization to use a lazy, on-demand approach instead of eager startup. The daemon now starts only when needed, improving application startup performance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

coderabbitai · 2025-12-02T20:47:17Z

Walkthrough

Refactored Kubernetes discovery daemon from eager startup to lazy, on-demand initialization. Restructured KubeDiscoveryClient with additional fields, replaced derived Clone with manual implementation, introduced DaemonHandles type, and added a metadata_watch() method that spawns the daemon only when first invoked.

Changes

Cohort / File(s)	Summary
Kubernetes Discovery Daemon Initialization `lib/runtime/src/discovery/kube.rs`	Transitioned daemon startup from eager (during initialization) to lazy (on first metadata watch request). Expanded KubeDiscoveryClient with kube_client, pod_info, cancel_token, and daemon_state fields. Removed #[derive(Clone)] and implemented explicit Clone trait. Introduced internal DaemonHandles type to encapsulate watch mechanism. Added async metadata_watch() method that initializes and manages daemon lifecycle via OnceCell. Updated list and list_and_watch methods to lazily invoke metadata_watch() instead of using pre-bound watch channel.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Thread-safety considerations: Verify OnceCell and Arc usage patterns correctly handle concurrent access and cloning across async boundaries
Error handling in lazy initialization: Ensure metadata_watch() error cases are properly propagated and don't cause silent daemon startup failures
Clone semantics: Confirm manual Clone implementation correctly duplicates all fields, especially Arc and CancellationToken
Caller compatibility: Review impact on existing callers of list/list_and_watch to ensure they handle lazy initialization behavior

Poem

🐰 A daemon sleeps until called upon,
No eager spawning at the break of dawn,
OnceCell keeps watch with gentle care,
Lazy initialization everywhere!
Hop, clone, and share with threads so fleet, ✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The PR description follows the required template structure but lacks substantive content in critical sections (Details and Where should the reviewer start), with only placeholder text in Related Issues.	Complete the Details section describing implementation changes, add guidance on where reviewers should focus, and replace the placeholder issue reference with the actual GitHub issue number.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Start Discovery daemon lazily' directly and accurately describes the main change: converting eager daemon startup to lazy on-demand initialization.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

lib/runtime/src/discovery/kube.rs (2)
74-85: Manual Clone implementation is correct and ensures shared daemon state.

All clones of KubeDiscoveryClient will share the same daemon_state, meaning the daemon is only started once regardless of how many clones exist. Consider adding a brief doc comment to make this sharing behavior explicit for future maintainers.

340-362: Lazy initialization via get_or_try_init is correctly implemented.

The pattern ensures thread-safe one-time initialization, properly propagates errors from DiscoveryDaemon::new(), and logs the startup event.

One observation: the spawned daemon's JoinHandle is dropped (fire-and-forget pattern). While the daemon uses cancel_token for shutdown signaling, storing the handle in DaemonHandles could enable graceful shutdown waiting or panic detection in the future.
 #[derive(Debug)]
 struct DaemonHandles {
     metadata_watch: tokio::sync::watch::Receiver<Arc<MetadataSnapshot>>,
+    #[allow(dead_code)] // Retained for future graceful shutdown
+    daemon_handle: tokio::task::JoinHandle<()>,
 }
-                tokio::spawn(async move {
+                let daemon_handle = tokio::spawn(async move {
                     if let Err(e) = daemon.run(watch_tx).await {
                         tracing::error!("Discovery daemon failed: {}", e);
                     }
                 });
                 // ... 
                 Ok::<Arc<DaemonHandles>, anyhow::Error>(Arc::new(DaemonHandles {
                     metadata_watch: watch_rx,
+                    daemon_handle,
                 }))

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03070fd and 2c80b06.

📒 Files selected for processing (1)

lib/runtime/src/discovery/kube.rs (5 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-08-15T23:51:04.958Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.

Applied to files:

lib/runtime/src/discovery/kube.rs

📚 Learning: 2025-08-15T23:51:04.958Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally. This is a consistent pattern used throughout the codebase in multiple critical modules.

Applied to files:

lib/runtime/src/discovery/kube.rs

📚 Learning: 2025-08-15T23:51:04.958Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally.

Applied to files:

lib/runtime/src/discovery/kube.rs

🧬 Code graph analysis (1)

lib/runtime/src/discovery/kube.rs (2)

lib/runtime/src/distributed.rs (1)

new (101-297)

lib/runtime/src/discovery/kube/daemon.rs (1)

new (38-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: operator (amd64)
GitHub Check: vllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: clippy (.)
GitHub Check: clippy (lib/runtime/examples)
GitHub Check: tests (launch/dynamo-run)
GitHub Check: tests (lib/bindings/python)
GitHub Check: clippy (launch/dynamo-run)
GitHub Check: clippy (lib/bindings/python)
GitHub Check: tests (lib/runtime/examples)
GitHub Check: tests (.)

🔇 Additional comments (5)

lib/runtime/src/discovery/kube.rs (5)

25-37: Struct redesign for lazy initialization looks correct.

The Arc<OnceCell<Arc<DaemonHandles>>> pattern is appropriate: the outer Arc allows sharing the OnceCell across clones of KubeDiscoveryClient, while OnceCell::get_or_try_init ensures thread-safe one-time initialization.

63-71: Constructor correctly defers daemon startup.

The initialization now stores the necessary components (kube_client, pod_info, cancel_token) without starting the daemon, enabling lazy startup on first use.

87-91: Watch receiver cloning pattern is correct.

Cloning a tokio::sync::watch::Receiver creates an independent view of the same channel, allowing multiple concurrent watchers.

159-166: Verify behavior: first call may return empty results before daemon fetches data.

The comment on line 165 correctly notes the snapshot "may be empty if daemon hasn't fetched yet." Since metadata_watch() returns immediately after spawning the daemon (without waiting for initial data fetch), the first call to list() could return an empty result even when instances exist.

Is this behavior acceptable for all callers? If callers depend on getting current state from the first list() call, consider either:

Documenting this behavior clearly in the method's doc comments

Having the daemon signal readiness after the initial fetch (if blocking is acceptable)

198-199: Lazy startup integration looks correct for the watch pattern.

Unlike list(), the watch pattern naturally handles the initially empty snapshot because subsequent changes will be delivered via the watch channel as the daemon fetches data.

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

Start Discover daemon lazily

2c80b06

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

atchernych requested a review from a team as a code owner December 2, 2025 20:43

pull-request-size bot added the size/M label Dec 2, 2025

atchernych marked this pull request as draft December 2, 2025 20:43

github-actions bot added the feat label Dec 2, 2025

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

cleanup

fdd88d6

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

pull-request-size bot added size/L and removed size/M labels Dec 2, 2025

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 22:28 Failure

atchernych changed the title ~~feat: Start Discover daemon lazily~~ feat: Start Discovery daemon lazily Dec 2, 2025

cleanup

c3bbddd

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 22:32 Failure

atchernych requested a review from mohammedabdulwahhab December 2, 2025 22:32

atchernych marked this pull request as ready for review December 2, 2025 22:32

better comment

6132edd

Signed-off-by: Anna Tchernych <atchernych@nvidia.com>

copy-pr-bot bot had a problem deploying to GITLAB December 2, 2025 22:36 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Start Discovery daemon lazily #4705

feat: Start Discovery daemon lazily #4705

Uh oh!

atchernych commented Dec 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Start Discovery daemon lazily #4705

Are you sure you want to change the base?

feat: Start Discovery daemon lazily #4705

Uh oh!

Conversation

atchernych commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

atchernych commented Dec 2, 2025 •

edited

Loading

coderabbitai bot commented Dec 2, 2025 •

edited

Loading