fix: improve sigterm handling for the entire runtime #3443

MasterPtato · 2025-11-10T22:50:21Z

No description provided.

MasterPtato · 2025-11-10T22:50:31Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

vercel · 2025-11-10T22:50:31Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Ready	Preview	Comment	Nov 11, 2025 1:45am

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 11, 2025 1:45am
rivet-inspector	Ignored	Preview	Nov 11, 2025 1:45am
rivet-site	Ignored	Preview	Nov 11, 2025 1:45am

pkg-pr-new · 2025-11-10T22:52:36Z

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3443

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3443

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3443

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3443

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3443

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3443

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3443

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3443

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3443

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3443

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3443

commit: 9f71238

claude · 2025-11-10T22:52:42Z

Pull Request Review

This PR improves SIGTERM handling across the runtime with a centralized signal handler and configurable graceful shutdown timeouts. Overall, the architecture is solid and the code quality is good. Here are my findings:

✅ Strengths

Excellent architectural improvement: Centralizing signal handling in rivet-runtime eliminates duplication and provides consistent behavior across all components.
Configurable shutdown timeouts: Making shutdown durations configurable (30s for workers, 1h for guard) provides operational flexibility.
Cross-platform support: Clean handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.
Progressive shutdown: The 3-SIGTERM threshold for force shutdown is a good UX pattern.
Resource cleanup: Proper distinction between services requiring graceful vs immediate shutdown.

🐛 Bugs & Issues

Critical

1. Bug in worker shutdown remaining_workflows count (engine/packages/gasoline/src/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Line 322 counts wf_futs iterator
Line 326 logs self.running_workflows.len() instead of remaining_workflows
These could be different values, making the log misleading
Fix: Change line 326 to use remaining_workflows variable

2. Potential infinite sleep in guard/worker shutdown loops (engine/packages/guard-core/src/server.rs:269, engine/packages/gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_start.elapsed() >= shutdown_duration, saturating_sub returns Duration::ZERO
tokio::time::sleep(Duration::ZERO) completes immediately
This causes a tight busy loop until the next SIGTERM or task completion
Fix: Break the loop when timeout is reached instead of relying on sleep

Medium

3. Missing error propagation in TermSignalHandler::new (engine/packages/runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

Uses expect() which panics on signal registration failure
Should return Result from TermSignal::new() or provide better context
Signal registration can fail due to process limits or permissions

4. Inconsistent comment style (engine/packages/config/src/config/runtime.rs:11-13,15)

Lines 11-12 use // comments
Should use /// for documentation comments on public struct fields
Affects generated documentation quality

⚠️ Potential Issues

5. Service restart behavior during shutdown (engine/packages/service-manager/src/lib.rs:155-166)

Services that crash during shutdown still attempt to restart after 1-second delay
The shutting_down flag is only set when SIGTERM is received, not when services naturally exit
Could cause unwanted restarts if a service crashes before shutdown signal arrives

6. Race condition in TermSignalHandler (engine/packages/runtime/src/term_signal.rs:74-76)

if self.tx.send(self.count >= FORCE_CLOSE_THRESHOLD).is_err() {
    tracing::debug!("no sigterm subscribers");
}

If all subscribers drop between signal receipt and send, the signal is silently lost
Low impact since signals typically trigger during shutdown when subscribers exist
Consider logging at warn level for force-close threshold

7. AtomicBool SeqCst ordering may be excessive (engine/packages/service-manager/src/lib.rs:333,340)

Using SeqCst for simple boolean flag is overly conservative
Acquire/Release ordering would suffice and be more performant
Low impact in practice

🎯 Performance Considerations

8. Unnecessary shutdown loop iteration (worker.rs:301-320, server.rs:257-274)

Both shutdown loops continue iterating after join_fut/graceful_fut complete
Once these futures complete, no need to check timeout or additional SIGTERMs
The break on line 307/260 exits correctly, but the loop structure is confusing

9. FuturesUnordered overhead (gasoline/src/worker.rs:294-298)

Creates FuturesUnordered for potentially many workflows
Only used to check completion, not to process results
Consider using futures::future::join_all() or checking running_workflows directly

📝 Style & Convention Issues

10. Hard tabs vs spaces (engine/packages/service-manager/Cargo.toml:19)

Line 19 has trailing inconsistency (no newline at EOF)
Per CLAUDE.md, Rust uses hard tabs but this is a TOML file
Should verify TOML formatting convention

11. Glob imports removed ✅

Good catch removing use anyhow::*; in favor of explicit imports
Follows CLAUDE.md guidelines

12. Structured logging compliance ✅

Properly uses tracing's structured logging throughout (e.g., ?err, ?duration)
Follows CLAUDE.md patterns

🧪 Test Coverage

13. No tests for new TermSignal functionality

No tests found for engine/packages/runtime/src/term_signal.rs
Signal handling is notoriously difficult to test but critical for reliability
Consider adding integration tests or manual test procedures

14. No tests for shutdown behavior

Worker and guard shutdown logic is complex with multiple edge cases
No tests for timeout behavior, force-shutdown, or graceful completion
Per CLAUDE.md: "Run tests with output displayed" - but there are no tests to run

🔐 Security Concerns

No security issues identified. Signal handling is appropriate for the threat model.

📋 Recommendations

Must fix before merge:

Fix bug [SVC-2555] Set up issue templates #1 (remaining_workflows count mismatch)
Fix bug [SVC-2479] Send cluster events to PostHog #2 (potential busy loop in shutdown timeouts)

Should fix:
3. Fix error propagation in TermSignalHandler initialization
4. Update comment style to use /// for doc comments
5. Add basic tests for signal handling behavior

Nice to have:
6. Simplify shutdown loop logic for clarity
7. Consider relaxed atomic ordering
8. Add integration test documentation for shutdown behavior

📍 Code References

Bug [SVC-2555] Set up issue templates #1: engine/packages/gasoline/src/worker.rs:322-326
Bug [SVC-2479] Send cluster events to PostHog #2: engine/packages/guard-core/src/server.rs:269, engine/packages/gasoline/src/worker.rs:315
Issue [SVC-2504] Fix 5 GB upload limit for local development from Cloudflare #3: engine/packages/runtime/src/term_signal.rs:88
Issue [SVC-2483] Remove hardcoded uses of rivet.gg #4: engine/packages/config/src/config/runtime.rs:11-16

Great work on improving the signal handling architecture! The centralized approach is much cleaner than the previous implementation.

claude · 2025-11-11T01:24:06Z

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR improves graceful shutdown handling across the Rivet runtime by introducing a centralized TermSignal handler, configurable shutdown timeouts, and better coordination between services during shutdown. Overall, this is a well-thought-out improvement to shutdown reliability.

Code Quality & Best Practices

Strengths ✅

Centralized signal handling: The new TermSignal handler in runtime/src/term_signal.rs provides a clean, reusable abstraction across all services
Configurable timeouts: Adding Runtime config with separate timeouts for worker and guard is excellent for operational flexibility
Cross-platform support: Proper handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals
Graceful degradation: The 3-signal threshold for forced shutdown is a good safety valve
Improved logging: Lowercase log messages follow CLAUDE.md conventions properly

Issues Found 🔴

1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info\!("all workflows evicted");
} else {
    tracing::warn\!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: You're logging self.running_workflows.len() but you should log remaining_workflows. Also, into_iter().count() on FuturesUnordered doesn't give you remaining futures - it consumes the iterator and counts items.

Fix: Track remaining count before the loop or use self.running_workflows.len() which hasn't been modified.

2. Comment style inconsistency (config/src/config/runtime.rs:11-16)

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.
allow_version_rollback: Option<bool>,

Problem: Doc comments should use /// consistently, not mix with //

Fix: Change // to /// on lines 12 and 15.

3. Potential race condition (service-manager/src/lib.rs:333-343)

abort = term_signal.recv() => {
    shutting_down.store(true, Ordering::SeqCst);
    
    // Abort services that don't require graceful shutdown
    running_services.retain(|(requires_graceful_shutdown, handle)| {
        if \!requires_graceful_shutdown {
            handle.abort();
        }
        *requires_graceful_shutdown
    });

Problem: Between setting shutting_down and calling abort(), a service might exit gracefully and log "service exited" instead of getting aborted. This is minor but could cause confusion in logs.

Suggestion: Consider setting the flag and aborting in a single atomic operation, or at least document this behavior.

4. Missing error context (runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

Problem: Per CLAUDE.md, we should avoid panics when possible and provide better error context.

Suggestion: Since this is in an async init context, consider propagating the error or providing more detailed panic message about what might cause this (e.g., "failed initializing termination signal handler - this may indicate the process doesn't have signal handling permissions").

5. Clippy warning potential (gasoline/src/worker.rs:303)

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Issue: Clippy will warn about while let Some(_) - the underscore binding is unnecessary.

Fix: Use while wf_futs.next().await.is_some() {} or keep the binding as Some(res) and use it.

6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

Problem: If shutdown_start.elapsed() > shutdown_duration, this will sleep for 0 duration and immediately fire, potentially in a tight loop if the other futures don't complete.

Better approach: Check if time is already elapsed before the select, or use a deadline-based approach with tokio::time::sleep_until.

Security Concerns 🔒

Low Risk Issues

Signal handler abuse: The 3-SIGTERM threshold is good, but a malicious actor could potentially keep a process alive by sending exactly 2 SIGTERMs repeatedly. This is a very minor concern in practice since they could just not send any signals.
No maximum timeout enforcement: The config allows setting arbitrarily long shutdown durations. Consider documenting reasonable limits or adding validation (e.g., max 24 hours).

Performance Considerations ⚡

Good Improvements

Removed redundant ctrl_c handling: The old code had both ctrl_c() and signal handlers - good cleanup
Better async coordination: Using FuturesUnordered for concurrent workflow shutdown is more efficient than the old polling approach

Minor Concerns

Tight select loop (service-manager/src/lib.rs:316-352): The main service loop doesn't have any explicit yield points between iterations. While tokio's select handles this well, consider if there's a need for explicit yielding.
Unbounded watch channel (runtime/src/term_signal.rs:40): The watch channel created for shutdown signaling is fine for this use case, but document that this is intentional for broadcast semantics.

Test Coverage 📊

Concerning Gaps

No tests for TermSignal: The new centralized signal handler has no unit tests
No tests for shutdown timeout behavior: Should test that:
- Worker shutdown respects the timeout
- Guard shutdown respects the timeout
- Multiple SIGTERMs trigger abort
Modified test config (engine/tests/common/ctx.rs): Tests now use requires_graceful_shutdown flags but there are no tests validating this behavior

Recommendations

Add unit tests for TermSignal::new() and recv() behavior
Add integration tests that send actual signals (or mock them) and verify shutdown timing
Test the 3-SIGTERM abort threshold
Test configuration parsing for the new Runtime config

Additional Observations

Documentation

Good: JSDoc-style comments on the Runtime struct fields
Missing: No migration guide for users who might have relied on old shutdown behavior
Missing: No explanation of why guard gets 1 hour vs worker gets 30 seconds (presumably because HTTP requests can be long-lived)

Code Removal

The removal of engine/packages/util/src/signal.rs is good - consolidating this logic makes sense. Confirm that no other code depended on this module.

Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)

Moving the "allocate pending actors" logic is a good refactoring - makes the shutdown path cleaner. The comment at line 597 explains the rationale well.

Recommendations

Must Fix Before Merge 🔴

Fix the remaining workflows count bug in gasoline worker shutdown
Fix doc comment consistency in runtime.rs
Fix the clippy warning for underscore binding

Should Fix 🟡

Improve shutdown duration timeout handling to avoid potential tight loops
Add basic unit tests for TermSignal
Propagate or better document the expect() in TermSignal initialization

Nice to Have 🟢

Add config validation for maximum shutdown durations
Document the rationale for different default timeouts
Add integration tests for shutdown behavior
Consider the service abort race condition

Overall Assessment

This is a solid improvement to the shutdown handling infrastructure. The centralization of signal handling and configurable timeouts are excellent architectural decisions. The main concerns are:

Testing coverage is inadequate for such critical infrastructure
A few minor bugs that should be fixed before merge
Some edge cases around timing could be handled more robustly

Recommendation: Request changes for the critical bugs, then approve once those are addressed. The testing gaps are concerning but not blocking if this is being monitored in production.

Review generated with Claude Code

claude · 2025-11-11T01:45:44Z

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

This PR introduces a centralized termination signal handler and improves graceful shutdown across the runtime. Overall, this is a well-structured improvement to signal handling. Below are my observations and recommendations:

✅ Strengths

Centralized Signal Handling: The new TermSignal handler in rivet-runtime provides a singleton pattern for signal management across the entire runtime, avoiding duplicate signal handlers.
Configurable Shutdown Timeouts: The new Runtime config allows operators to customize shutdown durations for different components (worker, guard), which is excellent for production flexibility.
Graceful Degradation: The multi-signal handling (counting up to 3 SIGTERMs before forcing shutdown) provides a good balance between graceful and forced termination.
Cross-platform Support: Proper handling of Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.
Better Logging: Improved structured logging throughout the shutdown process makes debugging much easier.

🐛 Potential Bugs & Issues

Critical Issues

Bug in gasoline/worker.rs:326 - Incorrect variable reference:
```
tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
```
This should use remaining_workflows (calculated on line 322) instead of self.running_workflows.len() which may be stale. The workflows map wasn't cleaned up before this log.

Comment Inconsistency in config/runtime.rs: Lines 12-16 use // instead of /// for doc comments on public struct fields:

// to 1 hour.  // ← Should be ///
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.  // ← Should be ///
allow_version_rollback: Option<bool>,

Moderate Issues

Potential Race Condition in service-manager/lib.rs: The shutting_down AtomicBool is set (line 333), then services are aborted (lines 336-342), but services check the flag in their loops. There's a window where a service might restart before being aborted. Consider aborting services first, then setting the flag, or use a more explicit coordination mechanism.
FuturesUnordered Not Consumed Properly in gasoline/worker.rs:303: The pattern creates a future that waits for all workflows:
```
let join_fut = async { while let Some(_) = wf_futs.next().await {} };
```
However, this future is used in a tokio::select! that may break early on timeout or abort. After the select, line 322 tries to count remaining workflows by consuming wf_futs, but if join_fut was selected and completed, wf_futs would be empty. The count would be correct, but the logic is confusing.
Unbounded Retry in Service Manager: Services that crash will retry indefinitely (lines 161-183). While shutdown detection exists, during normal operation a failing service could spam logs and consume resources. Consider a circuit breaker or exponential backoff.

🔒 Security Concerns

Panic in Critical Path (term_signal.rs:88):
```
.expect("failed initializing termination signal handler")
```
While unlikely, if signal handler initialization fails, the process panics. Consider returning a Result to the caller and handling initialization failures more gracefully at the application level.
No Validation on Shutdown Durations: The worker_shutdown_duration and guard_shutdown_duration accept any u32 value. Consider adding validation or warnings for unreasonable values (e.g., 0 seconds or years).

⚡ Performance Considerations

Cloning in service-manager: Each service clones config and pools on every restart (lines 162, 199, 235, 279). While Rust's Arc-based cloning is cheap, ensure these types are indeed cheap to clone.
Busy Loop in Guard Shutdown (guard-core/server.rs:257-274): The shutdown loop continuously polls the term signal and timeout. While not critical, this could be restructured to avoid unnecessary CPU cycles.

🧪 Test Coverage

No Tests for New Code: The new TermSignal handler and shutdown logic lack unit tests. Consider adding:
- Tests for signal counting and threshold behavior
- Tests for timeout behavior in worker shutdown
- Integration tests for graceful shutdown scenarios
- Tests for the requires_graceful_shutdown flag behavior
No Tests for Configuration: The new Runtime config struct and its accessor methods lack tests for default values and edge cases.

📝 Code Quality & Best Practices

Inconsistent Error Handling: Some places use anyhow::Result with glob imports (anyhow::*), others import explicitly. Per CLAUDE.md: "Do not glob import (::*) from anyhow". The guard-core changes correctly import anyhow::Result explicitly, but ensure consistency across all modified files.
Magic Numbers: FORCE_CLOSE_THRESHOLD = 3 in term_signal.rs:14 could be configurable or at least documented why 3 is the right number.
Unused Variable Pattern (service-manager/lib.rs:324):
```
while let Some(_) = handle_futs.next().await {}
```
Consider while handle_futs.next().await.is_some() {} for clarity, though both are idiomatic.
Documentation Missing: The new TermSignal::recv() method documents its return value well, but TermSignal::new() and TermSignal::stop() lack doc comments.

💡 Recommendations

Fix the critical bug on gasoline/worker.rs:326
Fix doc comment formatting in config/runtime.rs
Add validation for shutdown duration config values
Add unit tests for TermSignal behavior, especially the force-shutdown threshold
Add integration tests for graceful shutdown scenarios
Consider exponential backoff for service restarts
Document the magic number (3) for force shutdown threshold
Add doc comments to all public APIs

📊 Summary

Overall Assessment: Good improvement to signal handling infrastructure, but needs bug fixes and test coverage before merging.

Blockers:

Bug in worker.rs:326 (incorrect variable)
Doc comment formatting in runtime.rs

Recommended Before Merge:

Add tests for TermSignal
Add tests for shutdown behavior
Validate shutdown duration configs

Nice to Have:

Exponential backoff for service restarts
More comprehensive documentation
Circuit breaker for failing services

Let me know if you'd like me to elaborate on any of these points or if you have questions about the recommendations!

claude · 2025-11-11T01:46:02Z

PR Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR significantly improves graceful shutdown handling across the Rivet runtime. The changes are well-structured and follow good practices for distributed system lifecycle management.

Positive Aspects ✅

1. Excellent Architecture

Centralized signal handling: The new TermSignal implementation in rivet-runtime provides a clean, reusable abstraction for signal handling across all services
Configurable timeouts: Moving shutdown durations to configuration is a great improvement for operational flexibility
Graceful degradation: The force-close threshold (3 SIGTERMs) provides a good escape hatch

2. Well-Implemented Shutdown Logic

Worker shutdown (gasoline/worker.rs:270-330): Clean implementation using FuturesUnordered to track workflow completion
Guard shutdown (guard-core/server.rs:252-278): Proper handling of pending HTTP requests with configurable timeout
Service manager (service-manager/lib.rs:316-353): Smart differentiation between services requiring graceful shutdown vs. immediate abort

3. Code Quality

Follows repository conventions (structured logging with ? syntax, lowercase messages)
Good use of saturating_sub to prevent underflow in timeout calculations
Proper atomic boolean usage for coordinating shutdown state

Issues & Recommendations 🔍

Critical Issues

1. Bug in worker shutdown (gasoline/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: After the loop, wf_futs has been consumed by wf_futs.next().await, so into_iter().count() always returns 0. Additionally, the warning message uses self.running_workflows.len() instead of remaining_workflows.

Fix: Track remaining workflows before the iterator is consumed:

let remaining_workflows = wf_futs.len();
// ... loop ...
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(?remaining_workflows, "not all workflows evicted");
}

2. Comment inconsistency (config/runtime.rs:11-12, 14-15)
Comments use // instead of /// for doc comments. Should be:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,

Medium Priority Issues

3. Timeout calculation edge case (gasoline/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_start.elapsed() exceeds shutdown_duration, this creates a zero-duration sleep that will fire immediately on every loop iteration. While functionally correct (will break on next iteration), it's inefficient.

Consider: Break immediately if timeout already exceeded:

let elapsed = shutdown_start.elapsed();
if elapsed >= shutdown_duration {
    tracing::warn!("worker shutdown timed out");
    break;
}
_ = tokio::time::sleep(shutdown_duration - elapsed) => {

4. Missing newline at end of file (service-manager/Cargo.toml)
The file ends without a newline after tracing.workspace = true - this violates POSIX standards.

Minor Issues

5. Inconsistent import style (guard-core/server.rs:10)

use anyhow::Result;

This imports a single type, while the CLAUDE.md emphasizes not to glob import from anyhow. This is good, but consider consistency - other files use anyhow::* patterns that should be avoided per the style guide.

6. Windows signal handling (runtime/term_signal.rs:61-64)
The Windows implementation only handles Ctrl+C, not other termination signals. This is likely fine for the use case, but worth documenting that SIGTERM-equivalent behavior on Windows is limited.

Performance Considerations ⚡

Positive:

Using FuturesUnordered for concurrent task management is efficient
Atomic bool for shutdown signaling has minimal overhead
Graceful shutdown prevents dropped connections and lost work

No concerns about performance impact from these changes.

Security Considerations 🔒

Low Risk: This PR primarily affects shutdown behavior, which has minimal security implications.

Positive: Force-close threshold prevents indefinite hangs that could be exploited in DoS scenarios.

Test Coverage 🧪

Observation: No new tests added for the shutdown behavior. While testing shutdown logic can be complex, consider:

Unit tests for TermSignal to verify signal counting and force-close threshold
Integration tests that verify graceful shutdown actually waits for running workflows/requests
Tests for the timeout behavior

Architecture Considerations 🏗️

Question about pegboard changes (pegboard/workflows/actor/mod.rs:533-547):
The removal of the AllocatePendingActors logic from the actor cleanup path seems unrelated to SIGTERM handling. The comment at line 536-537 mentions "the actor is not allocated" - but this appears to be a behavioral change beyond shutdown improvements. Can you clarify if this is:

A bug fix that was discovered during shutdown testing?
An optimization to move this logic elsewhere?
Related to shutdown behavior in a non-obvious way?

Additional Notes

Great work on:

Clean separation of concerns (signal handling, configuration, shutdown logic)
Making timeouts configurable for different operational needs
Comprehensive handling across all service types (worker, guard, cron, oneshot)

Documentation: Consider adding a section to docs about:

The new runtime configuration options
Expected shutdown behavior and timeouts
What happens when force-close threshold is reached

Recommendation

Approve with minor fixes. Please address the critical bug in worker.rs:322-326 and the comment style in runtime.rs. The other issues are lower priority but worth considering for a follow-up PR.

Great improvement to the runtime's operational reliability! 🎉

🤖 Generated with Claude Code

MasterPtato requested a review from NathanFlurry November 10, 2025 22:50

This was referenced Nov 10, 2025

feat(gateway): websocket hibernation #3400

Open

feat(gas): implement smart pulling for horizontally scaled workers #3417

Open

chore(gas): rename wake sub to bump sub #3408

Open

fix(pb): dont immediately reallocate actors on runner stopping #3413

Open

vercel bot deployed to Preview – rivetkit-serverless November 10, 2025 22:51 View deployment

MasterPtato force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 660df24 to 7aff577 Compare November 11, 2025 01:21

MasterPtato mentioned this pull request Nov 11, 2025

fix(guard): handle websocket tasks during shutdown #3448

Draft

vercel bot deployed to Preview – rivetkit-serverless November 11, 2025 01:22 View deployment

fix: improve sigterm handling for the entire runtime

9f71238

MasterPtato force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 7aff577 to 9f71238 Compare November 11, 2025 01:44

MasterPtato force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from ff49fad to 728f761 Compare November 11, 2025 01:44

vercel bot deployed to Preview – rivetkit-serverless November 11, 2025 01:45 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: improve sigterm handling for the entire runtime #3443

fix: improve sigterm handling for the entire runtime #3443

Uh oh!

MasterPtato commented Nov 10, 2025

Uh oh!

MasterPtato commented Nov 10, 2025 •

edited

Loading

Uh oh!

vercel bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 10, 2025

Uh oh!

claude bot commented Nov 11, 2025

Uh oh!

claude bot commented Nov 11, 2025

Uh oh!

claude bot commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: improve sigterm handling for the entire runtime #3443

Are you sure you want to change the base?

fix: improve sigterm handling for the entire runtime #3443

Uh oh!

Conversation

MasterPtato commented Nov 10, 2025

Uh oh!

MasterPtato commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

vercel bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 10, 2025

Pull Request Review

✅ Strengths

🐛 Bugs & Issues

Critical

Medium

⚠️ Potential Issues

🎯 Performance Considerations

📝 Style & Convention Issues

🧪 Test Coverage

🔐 Security Concerns

📋 Recommendations

📍 Code References

Uh oh!

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

Summary

Code Quality & Best Practices

Strengths ✅

Issues Found 🔴

1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)

2. Comment style inconsistency (config/src/config/runtime.rs:11-16)

3. Potential race condition (service-manager/src/lib.rs:333-343)

4. Missing error context (runtime/src/term_signal.rs:88)

5. Clippy warning potential (gasoline/src/worker.rs:303)

6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)

Security Concerns 🔒

Low Risk Issues

Performance Considerations ⚡

Good Improvements

Minor Concerns

Test Coverage 📊

Concerning Gaps

Recommendations

Additional Observations

Documentation

Code Removal

Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)

Recommendations

Must Fix Before Merge 🔴

Should Fix 🟡

Nice to Have 🟢

Overall Assessment

Uh oh!

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

✅ Strengths

🐛 Potential Bugs & Issues

Critical Issues

Moderate Issues

🔒 Security Concerns

⚡ Performance Considerations

🧪 Test Coverage

📝 Code Quality & Best Practices

💡 Recommendations

📊 Summary

Uh oh!

claude bot commented Nov 11, 2025

PR Review: Improve SIGTERM Handling for the Entire Runtime

Summary

Positive Aspects ✅

1. Excellent Architecture

2. Well-Implemented Shutdown Logic

MasterPtato commented Nov 10, 2025 •

edited

Loading

vercel bot commented Nov 10, 2025 •

edited

Loading

pkg-pr-new bot commented Nov 10, 2025 •

edited

Loading