Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Nov 10, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vercel
Copy link

vercel bot commented Nov 10, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 11, 2025 1:45am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 11, 2025 1:45am
rivet-inspector Ignored Ignored Preview Nov 11, 2025 1:45am
rivet-site Ignored Ignored Preview Nov 11, 2025 1:45am

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 10, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3443

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3443

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3443

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3443

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3443

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3443

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3443

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3443

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3443

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3443

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3443

commit: 9f71238

@claude
Copy link

claude bot commented Nov 10, 2025

Pull Request Review

This PR improves SIGTERM handling across the runtime with a centralized signal handler and configurable graceful shutdown timeouts. Overall, the architecture is solid and the code quality is good. Here are my findings:

✅ Strengths

  1. Excellent architectural improvement: Centralizing signal handling in rivet-runtime eliminates duplication and provides consistent behavior across all components.

  2. Configurable shutdown timeouts: Making shutdown durations configurable (30s for workers, 1h for guard) provides operational flexibility.

  3. Cross-platform support: Clean handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  4. Progressive shutdown: The 3-SIGTERM threshold for force shutdown is a good UX pattern.

  5. Resource cleanup: Proper distinction between services requiring graceful vs immediate shutdown.

🐛 Bugs & Issues

Critical

1. Bug in worker shutdown remaining_workflows count (engine/packages/gasoline/src/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}
  • Line 322 counts wf_futs iterator
  • Line 326 logs self.running_workflows.len() instead of remaining_workflows
  • These could be different values, making the log misleading
  • Fix: Change line 326 to use remaining_workflows variable

2. Potential infinite sleep in guard/worker shutdown loops (engine/packages/guard-core/src/server.rs:269, engine/packages/gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {
  • If shutdown_start.elapsed() >= shutdown_duration, saturating_sub returns Duration::ZERO
  • tokio::time::sleep(Duration::ZERO) completes immediately
  • This causes a tight busy loop until the next SIGTERM or task completion
  • Fix: Break the loop when timeout is reached instead of relying on sleep

Medium

3. Missing error propagation in TermSignalHandler::new (engine/packages/runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");
  • Uses expect() which panics on signal registration failure
  • Should return Result from TermSignal::new() or provide better context
  • Signal registration can fail due to process limits or permissions

4. Inconsistent comment style (engine/packages/config/src/config/runtime.rs:11-13,15)

  • Lines 11-12 use // comments
  • Should use /// for documentation comments on public struct fields
  • Affects generated documentation quality

⚠️ Potential Issues

5. Service restart behavior during shutdown (engine/packages/service-manager/src/lib.rs:155-166)

  • Services that crash during shutdown still attempt to restart after 1-second delay
  • The shutting_down flag is only set when SIGTERM is received, not when services naturally exit
  • Could cause unwanted restarts if a service crashes before shutdown signal arrives

6. Race condition in TermSignalHandler (engine/packages/runtime/src/term_signal.rs:74-76)

if self.tx.send(self.count >= FORCE_CLOSE_THRESHOLD).is_err() {
    tracing::debug!("no sigterm subscribers");
}
  • If all subscribers drop between signal receipt and send, the signal is silently lost
  • Low impact since signals typically trigger during shutdown when subscribers exist
  • Consider logging at warn level for force-close threshold

7. AtomicBool SeqCst ordering may be excessive (engine/packages/service-manager/src/lib.rs:333,340)

  • Using SeqCst for simple boolean flag is overly conservative
  • Acquire/Release ordering would suffice and be more performant
  • Low impact in practice

🎯 Performance Considerations

8. Unnecessary shutdown loop iteration (worker.rs:301-320, server.rs:257-274)

  • Both shutdown loops continue iterating after join_fut/graceful_fut complete
  • Once these futures complete, no need to check timeout or additional SIGTERMs
  • The break on line 307/260 exits correctly, but the loop structure is confusing

9. FuturesUnordered overhead (gasoline/src/worker.rs:294-298)

  • Creates FuturesUnordered for potentially many workflows
  • Only used to check completion, not to process results
  • Consider using futures::future::join_all() or checking running_workflows directly

📝 Style & Convention Issues

10. Hard tabs vs spaces (engine/packages/service-manager/Cargo.toml:19)

  • Line 19 has trailing inconsistency (no newline at EOF)
  • Per CLAUDE.md, Rust uses hard tabs but this is a TOML file
  • Should verify TOML formatting convention

11. Glob imports removed

  • Good catch removing use anyhow::*; in favor of explicit imports
  • Follows CLAUDE.md guidelines

12. Structured logging compliance

  • Properly uses tracing's structured logging throughout (e.g., ?err, ?duration)
  • Follows CLAUDE.md patterns

🧪 Test Coverage

13. No tests for new TermSignal functionality

  • No tests found for engine/packages/runtime/src/term_signal.rs
  • Signal handling is notoriously difficult to test but critical for reliability
  • Consider adding integration tests or manual test procedures

14. No tests for shutdown behavior

  • Worker and guard shutdown logic is complex with multiple edge cases
  • No tests for timeout behavior, force-shutdown, or graceful completion
  • Per CLAUDE.md: "Run tests with output displayed" - but there are no tests to run

🔐 Security Concerns

No security issues identified. Signal handling is appropriate for the threat model.

📋 Recommendations

Must fix before merge:

  1. Fix bug [SVC-2555] Set up issue templates #1 (remaining_workflows count mismatch)
  2. Fix bug [SVC-2479] Send cluster events to PostHog #2 (potential busy loop in shutdown timeouts)

Should fix:
3. Fix error propagation in TermSignalHandler initialization
4. Update comment style to use /// for doc comments
5. Add basic tests for signal handling behavior

Nice to have:
6. Simplify shutdown loop logic for clarity
7. Consider relaxed atomic ordering
8. Add integration test documentation for shutdown behavior

📍 Code References

Great work on improving the signal handling architecture! The centralized approach is much cleaner than the previous implementation.

@claude
Copy link

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR improves graceful shutdown handling across the Rivet runtime by introducing a centralized TermSignal handler, configurable shutdown timeouts, and better coordination between services during shutdown. Overall, this is a well-thought-out improvement to shutdown reliability.

Code Quality & Best Practices

Strengths ✅

  1. Centralized signal handling: The new TermSignal handler in runtime/src/term_signal.rs provides a clean, reusable abstraction across all services
  2. Configurable timeouts: Adding Runtime config with separate timeouts for worker and guard is excellent for operational flexibility
  3. Cross-platform support: Proper handling of both Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals
  4. Graceful degradation: The 3-signal threshold for forced shutdown is a good safety valve
  5. Improved logging: Lowercase log messages follow CLAUDE.md conventions properly

Issues Found 🔴

1. Critical: Incorrect remaining workflows count (gasoline/src/worker.rs:326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info\!("all workflows evicted");
} else {
    tracing::warn\!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: You're logging self.running_workflows.len() but you should log remaining_workflows. Also, into_iter().count() on FuturesUnordered doesn't give you remaining futures - it consumes the iterator and counts items.

Fix: Track remaining count before the loop or use self.running_workflows.len() which hasn't been modified.

2. Comment style inconsistency (config/src/config/runtime.rs:11-16)

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
// the current version.
allow_version_rollback: Option<bool>,

Problem: Doc comments should use /// consistently, not mix with //

Fix: Change // to /// on lines 12 and 15.

3. Potential race condition (service-manager/src/lib.rs:333-343)

abort = term_signal.recv() => {
    shutting_down.store(true, Ordering::SeqCst);
    
    // Abort services that don't require graceful shutdown
    running_services.retain(|(requires_graceful_shutdown, handle)| {
        if \!requires_graceful_shutdown {
            handle.abort();
        }
        *requires_graceful_shutdown
    });

Problem: Between setting shutting_down and calling abort(), a service might exit gracefully and log "service exited" instead of getting aborted. This is minor but could cause confusion in logs.

Suggestion: Consider setting the flag and aborting in a single atomic operation, or at least document this behavior.

4. Missing error context (runtime/src/term_signal.rs:88)

let term_signal = TermSignalHandler::new()
    .expect("failed initializing termination signal handler");

Problem: Per CLAUDE.md, we should avoid panics when possible and provide better error context.

Suggestion: Since this is in an async init context, consider propagating the error or providing more detailed panic message about what might cause this (e.g., "failed initializing termination signal handler - this may indicate the process doesn't have signal handling permissions").

5. Clippy warning potential (gasoline/src/worker.rs:303)

let join_fut = async { while let Some(_) = wf_futs.next().await {} };

Issue: Clippy will warn about while let Some(_) - the underscore binding is unnecessary.

Fix: Use while wf_futs.next().await.is_some() {} or keep the binding as Some(res) and use it.

6. Shutdown duration calculation issue (gasoline/src/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

Problem: If shutdown_start.elapsed() > shutdown_duration, this will sleep for 0 duration and immediately fire, potentially in a tight loop if the other futures don't complete.

Better approach: Check if time is already elapsed before the select, or use a deadline-based approach with tokio::time::sleep_until.

Security Concerns 🔒

Low Risk Issues

  1. Signal handler abuse: The 3-SIGTERM threshold is good, but a malicious actor could potentially keep a process alive by sending exactly 2 SIGTERMs repeatedly. This is a very minor concern in practice since they could just not send any signals.

  2. No maximum timeout enforcement: The config allows setting arbitrarily long shutdown durations. Consider documenting reasonable limits or adding validation (e.g., max 24 hours).

Performance Considerations ⚡

Good Improvements

  1. Removed redundant ctrl_c handling: The old code had both ctrl_c() and signal handlers - good cleanup
  2. Better async coordination: Using FuturesUnordered for concurrent workflow shutdown is more efficient than the old polling approach

Minor Concerns

  1. Tight select loop (service-manager/src/lib.rs:316-352): The main service loop doesn't have any explicit yield points between iterations. While tokio's select handles this well, consider if there's a need for explicit yielding.

  2. Unbounded watch channel (runtime/src/term_signal.rs:40): The watch channel created for shutdown signaling is fine for this use case, but document that this is intentional for broadcast semantics.

Test Coverage 📊

Concerning Gaps

  1. No tests for TermSignal: The new centralized signal handler has no unit tests
  2. No tests for shutdown timeout behavior: Should test that:
    • Worker shutdown respects the timeout
    • Guard shutdown respects the timeout
    • Multiple SIGTERMs trigger abort
  3. Modified test config (engine/tests/common/ctx.rs): Tests now use requires_graceful_shutdown flags but there are no tests validating this behavior

Recommendations

  • Add unit tests for TermSignal::new() and recv() behavior
  • Add integration tests that send actual signals (or mock them) and verify shutdown timing
  • Test the 3-SIGTERM abort threshold
  • Test configuration parsing for the new Runtime config

Additional Observations

Documentation

  1. Good: JSDoc-style comments on the Runtime struct fields
  2. Missing: No migration guide for users who might have relied on old shutdown behavior
  3. Missing: No explanation of why guard gets 1 hour vs worker gets 30 seconds (presumably because HTTP requests can be long-lived)

Code Removal

The removal of engine/packages/util/src/signal.rs is good - consolidating this logic makes sense. Confirm that no other code depended on this module.

Pegboard Actor Change (pegboard/src/workflows/actor/mod.rs:533-567)

Moving the "allocate pending actors" logic is a good refactoring - makes the shutdown path cleaner. The comment at line 597 explains the rationale well.

Recommendations

Must Fix Before Merge 🔴

  1. Fix the remaining workflows count bug in gasoline worker shutdown
  2. Fix doc comment consistency in runtime.rs
  3. Fix the clippy warning for underscore binding

Should Fix 🟡

  1. Improve shutdown duration timeout handling to avoid potential tight loops
  2. Add basic unit tests for TermSignal
  3. Propagate or better document the expect() in TermSignal initialization

Nice to Have 🟢

  1. Add config validation for maximum shutdown durations
  2. Document the rationale for different default timeouts
  3. Add integration tests for shutdown behavior
  4. Consider the service abort race condition

Overall Assessment

This is a solid improvement to the shutdown handling infrastructure. The centralization of signal handling and configurable timeouts are excellent architectural decisions. The main concerns are:

  • Testing coverage is inadequate for such critical infrastructure
  • A few minor bugs that should be fixed before merge
  • Some edge cases around timing could be handled more robustly

Recommendation: Request changes for the critical bugs, then approve once those are addressed. The testing gaps are concerning but not blocking if this is being monitored in production.


Review generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-10-fix_improve_sigterm_handling_for_the_entire_runtime branch from 7aff577 to 9f71238 Compare November 11, 2025 01:44
@MasterPtato MasterPtato force-pushed the 11-07-feat_gas_implement_smart_pulling_for_horizontally_scaled_workers branch from ff49fad to 728f761 Compare November 11, 2025 01:44
@claude
Copy link

claude bot commented Nov 11, 2025

Pull Request Review: Improve SIGTERM Handling for the Entire Runtime

This PR introduces a centralized termination signal handler and improves graceful shutdown across the runtime. Overall, this is a well-structured improvement to signal handling. Below are my observations and recommendations:


Strengths

  1. Centralized Signal Handling: The new TermSignal handler in rivet-runtime provides a singleton pattern for signal management across the entire runtime, avoiding duplicate signal handlers.

  2. Configurable Shutdown Timeouts: The new Runtime config allows operators to customize shutdown durations for different components (worker, guard), which is excellent for production flexibility.

  3. Graceful Degradation: The multi-signal handling (counting up to 3 SIGTERMs before forcing shutdown) provides a good balance between graceful and forced termination.

  4. Cross-platform Support: Proper handling of Unix (SIGTERM/SIGINT) and Windows (Ctrl+C) signals.

  5. Better Logging: Improved structured logging throughout the shutdown process makes debugging much easier.


🐛 Potential Bugs & Issues

Critical Issues

  1. Bug in gasoline/worker.rs:326 - Incorrect variable reference:

    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");

    This should use remaining_workflows (calculated on line 322) instead of self.running_workflows.len() which may be stale. The workflows map wasn't cleaned up before this log.

  2. Comment Inconsistency in config/runtime.rs: Lines 12-16 use // instead of /// for doc comments on public struct fields:

    // to 1 hour.  // ← Should be ///
    guard_shutdown_duration: Option<u32>,
    /// Whether or not to allow running the engine when the previous version that was run is higher than
    // the current version.  // ← Should be ///
    allow_version_rollback: Option<bool>,

Moderate Issues

  1. Potential Race Condition in service-manager/lib.rs: The shutting_down AtomicBool is set (line 333), then services are aborted (lines 336-342), but services check the flag in their loops. There's a window where a service might restart before being aborted. Consider aborting services first, then setting the flag, or use a more explicit coordination mechanism.

  2. FuturesUnordered Not Consumed Properly in gasoline/worker.rs:303: The pattern creates a future that waits for all workflows:

    let join_fut = async { while let Some(_) = wf_futs.next().await {} };

    However, this future is used in a tokio::select! that may break early on timeout or abort. After the select, line 322 tries to count remaining workflows by consuming wf_futs, but if join_fut was selected and completed, wf_futs would be empty. The count would be correct, but the logic is confusing.

  3. Unbounded Retry in Service Manager: Services that crash will retry indefinitely (lines 161-183). While shutdown detection exists, during normal operation a failing service could spam logs and consume resources. Consider a circuit breaker or exponential backoff.


🔒 Security Concerns

  1. Panic in Critical Path (term_signal.rs:88):

    .expect("failed initializing termination signal handler")

    While unlikely, if signal handler initialization fails, the process panics. Consider returning a Result to the caller and handling initialization failures more gracefully at the application level.

  2. No Validation on Shutdown Durations: The worker_shutdown_duration and guard_shutdown_duration accept any u32 value. Consider adding validation or warnings for unreasonable values (e.g., 0 seconds or years).


Performance Considerations

  1. Cloning in service-manager: Each service clones config and pools on every restart (lines 162, 199, 235, 279). While Rust's Arc-based cloning is cheap, ensure these types are indeed cheap to clone.

  2. Busy Loop in Guard Shutdown (guard-core/server.rs:257-274): The shutdown loop continuously polls the term signal and timeout. While not critical, this could be restructured to avoid unnecessary CPU cycles.


🧪 Test Coverage

  1. No Tests for New Code: The new TermSignal handler and shutdown logic lack unit tests. Consider adding:

    • Tests for signal counting and threshold behavior
    • Tests for timeout behavior in worker shutdown
    • Integration tests for graceful shutdown scenarios
    • Tests for the requires_graceful_shutdown flag behavior
  2. No Tests for Configuration: The new Runtime config struct and its accessor methods lack tests for default values and edge cases.


📝 Code Quality & Best Practices

  1. Inconsistent Error Handling: Some places use anyhow::Result with glob imports (anyhow::*), others import explicitly. Per CLAUDE.md: "Do not glob import (::*) from anyhow". The guard-core changes correctly import anyhow::Result explicitly, but ensure consistency across all modified files.

  2. Magic Numbers: FORCE_CLOSE_THRESHOLD = 3 in term_signal.rs:14 could be configurable or at least documented why 3 is the right number.

  3. Unused Variable Pattern (service-manager/lib.rs:324):

    while let Some(_) = handle_futs.next().await {}

    Consider while handle_futs.next().await.is_some() {} for clarity, though both are idiomatic.

  4. Documentation Missing: The new TermSignal::recv() method documents its return value well, but TermSignal::new() and TermSignal::stop() lack doc comments.


💡 Recommendations

  1. Fix the critical bug on gasoline/worker.rs:326
  2. Fix doc comment formatting in config/runtime.rs
  3. Add validation for shutdown duration config values
  4. Add unit tests for TermSignal behavior, especially the force-shutdown threshold
  5. Add integration tests for graceful shutdown scenarios
  6. Consider exponential backoff for service restarts
  7. Document the magic number (3) for force shutdown threshold
  8. Add doc comments to all public APIs

📊 Summary

Overall Assessment: Good improvement to signal handling infrastructure, but needs bug fixes and test coverage before merging.

Blockers:

  • Bug in worker.rs:326 (incorrect variable)
  • Doc comment formatting in runtime.rs

Recommended Before Merge:

  • Add tests for TermSignal
  • Add tests for shutdown behavior
  • Validate shutdown duration configs

Nice to Have:

  • Exponential backoff for service restarts
  • More comprehensive documentation
  • Circuit breaker for failing services

Let me know if you'd like me to elaborate on any of these points or if you have questions about the recommendations!

@claude
Copy link

claude bot commented Nov 11, 2025

PR Review: Improve SIGTERM Handling for the Entire Runtime

Summary

This PR significantly improves graceful shutdown handling across the Rivet runtime. The changes are well-structured and follow good practices for distributed system lifecycle management.

Positive Aspects ✅

1. Excellent Architecture

  • Centralized signal handling: The new TermSignal implementation in rivet-runtime provides a clean, reusable abstraction for signal handling across all services
  • Configurable timeouts: Moving shutdown durations to configuration is a great improvement for operational flexibility
  • Graceful degradation: The force-close threshold (3 SIGTERMs) provides a good escape hatch

2. Well-Implemented Shutdown Logic

  • Worker shutdown (gasoline/worker.rs:270-330): Clean implementation using FuturesUnordered to track workflow completion
  • Guard shutdown (guard-core/server.rs:252-278): Proper handling of pending HTTP requests with configurable timeout
  • Service manager (service-manager/lib.rs:316-353): Smart differentiation between services requiring graceful shutdown vs. immediate abort

3. Code Quality

  • Follows repository conventions (structured logging with ? syntax, lowercase messages)
  • Good use of saturating_sub to prevent underflow in timeout calculations
  • Proper atomic boolean usage for coordinating shutdown state

Issues & Recommendations 🔍

Critical Issues

1. Bug in worker shutdown (gasoline/worker.rs:322-326)

let remaining_workflows = wf_futs.into_iter().count();
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(remaining_workflows=?self.running_workflows.len(), "not all workflows evicted");
}

Problem: After the loop, wf_futs has been consumed by wf_futs.next().await, so into_iter().count() always returns 0. Additionally, the warning message uses self.running_workflows.len() instead of remaining_workflows.

Fix: Track remaining workflows before the iterator is consumed:

let remaining_workflows = wf_futs.len();
// ... loop ...
if remaining_workflows == 0 {
    tracing::info!("all workflows evicted");
} else {
    tracing::warn!(?remaining_workflows, "not all workflows evicted");
}

2. Comment inconsistency (config/runtime.rs:11-12, 14-15)
Comments use // instead of /// for doc comments. Should be:

/// Time (in seconds) to allow for guard to wait for pending requests after receiving SIGTERM. Defaults
/// to 1 hour.
guard_shutdown_duration: Option<u32>,
/// Whether or not to allow running the engine when the previous version that was run is higher than
/// the current version.
allow_version_rollback: Option<bool>,

Medium Priority Issues

3. Timeout calculation edge case (gasoline/worker.rs:315)

_ = tokio::time::sleep(shutdown_duration.saturating_sub(shutdown_start.elapsed())) => {

If shutdown_start.elapsed() exceeds shutdown_duration, this creates a zero-duration sleep that will fire immediately on every loop iteration. While functionally correct (will break on next iteration), it's inefficient.

Consider: Break immediately if timeout already exceeded:

let elapsed = shutdown_start.elapsed();
if elapsed >= shutdown_duration {
    tracing::warn!("worker shutdown timed out");
    break;
}
_ = tokio::time::sleep(shutdown_duration - elapsed) => {

4. Missing newline at end of file (service-manager/Cargo.toml)
The file ends without a newline after tracing.workspace = true - this violates POSIX standards.

Minor Issues

5. Inconsistent import style (guard-core/server.rs:10)

use anyhow::Result;

This imports a single type, while the CLAUDE.md emphasizes not to glob import from anyhow. This is good, but consider consistency - other files use anyhow::* patterns that should be avoided per the style guide.

6. Windows signal handling (runtime/term_signal.rs:61-64)
The Windows implementation only handles Ctrl+C, not other termination signals. This is likely fine for the use case, but worth documenting that SIGTERM-equivalent behavior on Windows is limited.

Performance Considerations ⚡

Positive:

  • Using FuturesUnordered for concurrent task management is efficient
  • Atomic bool for shutdown signaling has minimal overhead
  • Graceful shutdown prevents dropped connections and lost work

No concerns about performance impact from these changes.

Security Considerations 🔒

Low Risk: This PR primarily affects shutdown behavior, which has minimal security implications.

Positive: Force-close threshold prevents indefinite hangs that could be exploited in DoS scenarios.

Test Coverage 🧪

Observation: No new tests added for the shutdown behavior. While testing shutdown logic can be complex, consider:

  • Unit tests for TermSignal to verify signal counting and force-close threshold
  • Integration tests that verify graceful shutdown actually waits for running workflows/requests
  • Tests for the timeout behavior

Architecture Considerations 🏗️

Question about pegboard changes (pegboard/workflows/actor/mod.rs:533-547):
The removal of the AllocatePendingActors logic from the actor cleanup path seems unrelated to SIGTERM handling. The comment at line 536-537 mentions "the actor is not allocated" - but this appears to be a behavioral change beyond shutdown improvements. Can you clarify if this is:

  1. A bug fix that was discovered during shutdown testing?
  2. An optimization to move this logic elsewhere?
  3. Related to shutdown behavior in a non-obvious way?

Additional Notes

Great work on:

  • Clean separation of concerns (signal handling, configuration, shutdown logic)
  • Making timeouts configurable for different operational needs
  • Comprehensive handling across all service types (worker, guard, cron, oneshot)

Documentation: Consider adding a section to docs about:

  • The new runtime configuration options
  • Expected shutdown behavior and timeouts
  • What happens when force-close threshold is reached

Recommendation

Approve with minor fixes. Please address the critical bug in worker.rs:322-326 and the comment style in runtime.rs. The other issues are lower priority but worth considering for a follow-up PR.

Great improvement to the runtime's operational reliability! 🎉


🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants