-
Notifications
You must be signed in to change notification settings - Fork 720
feat: Start Discovery daemon lazily #4705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
WalkthroughRefactored Kubernetes discovery daemon from eager startup to lazy, on-demand initialization. Restructured KubeDiscoveryClient with additional fields, replaced derived Clone with manual implementation, introduced DaemonHandles type, and added a metadata_watch() method that spawns the daemon only when first invoked. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
lib/runtime/src/discovery/kube.rs (2)
74-85: Manual Clone implementation is correct and ensures shared daemon state.All clones of
KubeDiscoveryClientwill share the samedaemon_state, meaning the daemon is only started once regardless of how many clones exist. Consider adding a brief doc comment to make this sharing behavior explicit for future maintainers.
340-362: Lazy initialization viaget_or_try_initis correctly implemented.The pattern ensures thread-safe one-time initialization, properly propagates errors from
DiscoveryDaemon::new(), and logs the startup event.One observation: the spawned daemon's
JoinHandleis dropped (fire-and-forget pattern). While the daemon usescancel_tokenfor shutdown signaling, storing the handle inDaemonHandlescould enable graceful shutdown waiting or panic detection in the future.#[derive(Debug)] struct DaemonHandles { metadata_watch: tokio::sync::watch::Receiver<Arc<MetadataSnapshot>>, + #[allow(dead_code)] // Retained for future graceful shutdown + daemon_handle: tokio::task::JoinHandle<()>, }- tokio::spawn(async move { + let daemon_handle = tokio::spawn(async move { if let Err(e) = daemon.run(watch_tx).await { tracing::error!("Discovery daemon failed: {}", e); } }); // ... Ok::<Arc<DaemonHandles>, anyhow::Error>(Arc::new(DaemonHandles { metadata_watch: watch_rx, + daemon_handle, }))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/runtime/src/discovery/kube.rs(5 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, `PrefixWatcher` uses `#[derive(Dissolve)]` to generate a `dissolve()` method. The pattern `let (_, _watcher, mut events_rx) = prefix_watcher.dissolve();` is the standard and intended usage throughout the codebase. The `mpsc::Receiver<WatchEvent>` maintains the etcd watch stream independently, so the `Watcher` handle can be safely dropped. This pattern is used consistently in critical infrastructure modules like component/client.rs, utils/leader_worker_barrier.rs, and entrypoint/input/http.rs.
Applied to files:
lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally. This is a consistent pattern used throughout the codebase in multiple critical modules.
Applied to files:
lib/runtime/src/discovery/kube.rs
📚 Learning: 2025-08-15T23:51:04.958Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2465
File: lib/runtime/src/utils/typed_prefix_watcher.rs:94-101
Timestamp: 2025-08-15T23:51:04.958Z
Learning: In the dynamo codebase's etcd client implementation, when using `kv_get_and_watch_prefix()` and `dissolve()`, the returned `events_rx` receiver maintains the etcd watch stream independently. The watcher handle can be safely dropped (using `_watcher`) without terminating the stream, as the receiver keeps the connection alive internally.
Applied to files:
lib/runtime/src/discovery/kube.rs
🧬 Code graph analysis (1)
lib/runtime/src/discovery/kube.rs (2)
lib/runtime/src/distributed.rs (1)
new(101-297)lib/runtime/src/discovery/kube/daemon.rs (1)
new(38-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: clippy (.)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: tests (.)
🔇 Additional comments (5)
lib/runtime/src/discovery/kube.rs (5)
25-37: Struct redesign for lazy initialization looks correct.The
Arc<OnceCell<Arc<DaemonHandles>>>pattern is appropriate: the outerArcallows sharing theOnceCellacross clones ofKubeDiscoveryClient, whileOnceCell::get_or_try_initensures thread-safe one-time initialization.
63-71: Constructor correctly defers daemon startup.The initialization now stores the necessary components (
kube_client,pod_info,cancel_token) without starting the daemon, enabling lazy startup on first use.
87-91: Watch receiver cloning pattern is correct.Cloning a
tokio::sync::watch::Receivercreates an independent view of the same channel, allowing multiple concurrent watchers.
159-166: Verify behavior: first call may return empty results before daemon fetches data.The comment on line 165 correctly notes the snapshot "may be empty if daemon hasn't fetched yet." Since
metadata_watch()returns immediately after spawning the daemon (without waiting for initial data fetch), the first call tolist()could return an empty result even when instances exist.Is this behavior acceptable for all callers? If callers depend on getting current state from the first
list()call, consider either:
- Documenting this behavior clearly in the method's doc comments
- Having the daemon signal readiness after the initial fetch (if blocking is acceptable)
198-199: Lazy startup integration looks correct for the watch pattern.Unlike
list(), the watch pattern naturally handles the initially empty snapshot because subsequent changes will be delivered via the watch channel as the daemon fetches data.
Signed-off-by: Anna Tchernych <atchernych@nvidia.com>
Overview:
Start ing Discovery daemon lazily would allow not to run the discovery daemon for prefill worker.
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.