-
Notifications
You must be signed in to change notification settings - Fork 131
fix: improve sigterm handling for the entire runtime #3443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers
Are you sure you want to change the base?
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
More templates
@rivetkit/actor
@rivetkit/cloudflare-workers
@rivetkit/core
@rivetkit/db
@rivetkit/framework-base
@rivetkit/next-js
@rivetkit/react
rivetkit
@rivetkit/sql-loader
@rivetkit/engine-runner
@rivetkit/engine-runner-protocol
commit: |
Pull Request ReviewThis PR improves SIGTERM handling across the runtime with a centralized signal handler and configurable graceful shutdown timeouts. Overall, the architecture is solid and the code quality is good. Here are my findings: ✅ Strengths
🐛 Bugs & IssuesCritical1. Bug in worker shutdown remaining_workflows count ( let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}
2. Potential infinite sleep in guard/worker shutdown loops ( _ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {
Medium3. Missing error propagation in TermSignalHandler::new ( let term_signal = TermSignalHandler::new()
.expect("failed initializing termination signal handler");
4. Inconsistent comment style (
|
660df24 to
7aff577
Compare
Pull Request Review: Improve SIGTERM Handling for the Entire RuntimeSummaryThis PR improves graceful shutdown handling across the Rivet runtime by introducing a centralized Code Quality & Best PracticesStrengths ✅
Issues Found 🔴1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info\!("all workflows evicted");
} else {
tracing::warn\!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: You're logging Fix: Track remaining count before the loop or use 2. Comment style inconsistency (config/src/config/runtime.rs:11-16)/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.
allow_version_rollback: Option<bool>,Problem: Doc comments should use Fix: Change 3. Potential race condition (service-manager/src/lib.rs:333-343)abort = term_signal.recv() => {
shutting_down.store(true, Ordering::SeqCst);
// Abort services that don't require graceful shutdown
running_services.retain(|(requires_graceful_shutdown, handle)| {
if \!requires_graceful_shutdown {
handle.abort();
}
*requires_graceful_shutdown
});Problem: Between setting Suggestion: Consider setting the flag and aborting in a single atomic operation, or at least document this behavior. 4. Missing error context (runtime/src/term_signal.rs:88)let term_signal = TermSignalHandler::new()
.expect("failed initializing termination signal handler");Problem: Per CLAUDE.md, we should avoid panics when possible and provide better error context. Suggestion: Since this is in an async init context, consider propagating the error or providing more detailed panic message about what might cause this (e.g., "failed initializing termination signal handler - this may indicate the process doesn't have signal handling permissions"). 5. Clippy warning potential (gasoline/src/worker.rs:303)let join_fut = async { while let Some(_) = wf_futs.next().await {} };Issue: Clippy will warn about Fix: Use 6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {Problem: If Better approach: Check if time is already elapsed before the select, or use a deadline-based approach with Security Concerns 🔒Low Risk Issues
Performance Considerations ⚡Good Improvements
Minor Concerns
Test Coverage 📊Concerning Gaps
Recommendations
Additional ObservationsDocumentation
Code RemovalThe removal of Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)Moving the "allocate pending actors" logic is a good refactoring - makes the shutdown path cleaner. The comment at line 597 explains the rationale well. RecommendationsMust Fix Before Merge 🔴
Should Fix 🟡
Nice to Have 🟢
Overall AssessmentThis is a solid improvement to the shutdown handling infrastructure. The centralization of signal handling and configurable timeouts are excellent architectural decisions. The main concerns are:
Recommendation: Request changes for the critical bugs, then approve once those are addressed. The testing gaps are concerning but not blocking if this is being monitored in production. Review generated with Claude Code |
7aff577 to
9f71238
Compare
ff49fad to
728f761
Compare
Pull Request Review: Improve SIGTERM Handling for the Entire RuntimeThis PR introduces a centralized termination signal handler and improves graceful shutdown across the runtime. Overall, this is a well-structured improvement to signal handling. Below are my observations and recommendations: ✅ Strengths
🐛 Potential Bugs & IssuesCritical Issues
Moderate Issues
🔒 Security Concerns
⚡ Performance Considerations
🧪 Test Coverage
📝 Code Quality & Best Practices
💡 Recommendations
📊 SummaryOverall Assessment: Good improvement to signal handling infrastructure, but needs bug fixes and test coverage before merging. Blockers:
Recommended Before Merge:
Nice to Have:
Let me know if you'd like me to elaborate on any of these points or if you have questions about the recommendations! |
PR Review: Improve SIGTERM Handling for the Entire RuntimeSummaryThis PR significantly improves graceful shutdown handling across the Rivet runtime. The changes are well-structured and follow good practices for distributed system lifecycle management. Positive Aspects ✅1. Excellent Architecture
2. Well-Implemented Shutdown Logic
3. Code Quality
Issues & Recommendations 🔍Critical Issues1. Bug in worker shutdown (gasoline/worker.rs:322-326) let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}Problem: After the loop, Fix: Track remaining workflows before the iterator is consumed: let remaining_workflows = wf_futs.len();
// ... loop ...
if remaining_workflows == 0 {
tracing::info!("all workflows evicted");
} else {
tracing::warn!(?remaining_workflows, "not all workflows evicted");
}2. Comment inconsistency (config/runtime.rs:11-12, 14-15) /// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,Medium Priority Issues3. Timeout calculation edge case (gasoline/worker.rs:315) _ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {If Consider: Break immediately if timeout already exceeded: let elapsed = shutdown_start.elapsed();
if elapsed >= shutdown_duration {
tracing::warn!("worker shutdown timed out");
break;
}
_ = tokio::time::sleep(shutdown_duration - elapsed) => {4. Missing newline at end of file ( Minor Issues5. Inconsistent import style ( use anyhow::Result;This imports a single type, while the CLAUDE.md emphasizes not to glob import from anyhow. This is good, but consider consistency - other files use 6. Windows signal handling ( Performance Considerations ⚡Positive:
No concerns about performance impact from these changes. Security Considerations 🔒Low Risk: This PR primarily affects shutdown behavior, which has minimal security implications. Positive: Force-close threshold prevents indefinite hangs that could be exploited in DoS scenarios. Test Coverage 🧪Observation: No new tests added for the shutdown behavior. While testing shutdown logic can be complex, consider:
Architecture Considerations 🏗️Question about pegboard changes (
Additional NotesGreat work on:
Documentation: Consider adding a section to docs about:
RecommendationApprove with minor fixes. Please address the critical bug in worker.rs:322-326 and the comment style in runtime.rs. The other issues are lower priority but worth considering for a follow-up PR. Great improvement to the runtime's operational reliability! 🎉 🤖 Generated with Claude Code |

No description provided.