fix(pb): dont immediately reallocate actors on runner stopping #3413

MasterPtato · 2025-11-07T22:49:57Z

No description provided.

vercel · 2025-11-07T22:50:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
rivetkit-serverless	Ready	Preview	Comment	Nov 10, 2025 10:51pm

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Nov 10, 2025 10:51pm
rivet-inspector	Ignored	Preview	Nov 10, 2025 10:51pm
rivet-site	Ignored	Preview	Nov 10, 2025 10:51pm

MasterPtato · 2025-11-07T22:50:18Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-07T22:51:31Z

PR Review: Fix actor reallocation on runner stopping

Summary

This PR introduces a new GoingAway signal to gracefully handle actors when a runner is stopping, preventing immediate reallocation and allowing for controlled shutdown. This is a significant improvement to the actor lifecycle management.

Positive Changes

Architecture Improvements

✅ New GoingAway signal: Introduces a clear separation between actors being lost (unexpected) and runners going away (planned shutdown). This is a good architectural decision that makes the state transitions more explicit.

✅ Better state tracking: Adding stopping and going_away flags to LifecycleState provides better visibility into the actor's current state transitions.

✅ Cleaner separation of concerns: The refactoring of handle_stopped to use StoppedVariant enum and StoppedResult improves code clarity and makes the different stop scenarios more explicit.

Code Quality

✅ Removed unnecessary kill parameter: The removal of the kill parameter from the destroy workflow and LifecycleResult simplifies the interface and removes a confusing boolean flag.

✅ Better logging: The debug log in handle_stopped now logs the full variant which provides better observability.

✅ Idempotency improvements: Added guards to prevent duplicate signal sends (e.g., checking !state.stopping before sending stop signals).

Issues and Concerns

1. Potential Race Condition in `GoingAway` Handler (High Priority)

Location: actor/mod.rs:471-477

Main::GoingAway(sig) => {
    // Ignore signals for previous generations
    if sig.generation != state.generation {
        return Ok(Loop::Continue);
    }

    if sig.reset_rescheduling {
        state.reschedule_state = Default::default();
    }

    if !state.going_away {
        let Some(runner_workflow_id) = state.runner_workflow_id else {
            return Ok(Loop::Continue);
        };

Issue: The check if !state.going_away happens AFTER checking runner_workflow_id. However, if the actor is not allocated (no runner_workflow_id), we return early but don't set going_away = true. This means the signal could be processed multiple times if the actor gets reallocated.

Recommendation: Set state.going_away = true before the runner_workflow_id check, or handle this case more explicitly.

2. Missing State Transitions (Medium Priority)

Location: actor/mod.rs:333-361

The code sets state.stopping = true when handling ActorIntentStop, but I don't see where state.going_away transitions back to false in all code paths. While it's cleared in handle_stopped:602-604, there might be edge cases where the state could get stuck.

Recommendation: Add explicit state machine documentation or assertions to ensure all transitions are covered.

3. Generation Mismatch Edge Case (Low Priority)

Location: actor/mod.rs:640-654

// We don't know the state of the previous generation of this actor actor if it becomes lost, send stop
// command in case it ended up allocating
if let (StoppedVariant::Lost { .. }, Some(old_runner_workflow_id)) =
    (&variant, old_runner_workflow_id)
{
    ctx.signal(crate::workflows::runner::Command {
        inner: protocol::Command::CommandStopActor(protocol::CommandStopActor {
            actor_id: input.actor_id.to_string(),
            generation: state.generation,
        }),
    })

Issue: The comment says "previous generation" but the code uses state.generation which is the current generation. If we're stopping the old generation, should we use state.generation - 1?

Recommendation: Verify if this is correct or update the comment to match the code's intent.

4. Typo in Comment (Trivial)

Location: actor/mod.rs:640

// We don't know the state of the previous generation of this actor actor if it becomes lost

Issue: Double "actor" in the comment.

Recommendation: Change to "of this actor if it becomes lost".

Performance Considerations

✅ No significant performance concerns: The changes don't introduce expensive operations or blocking calls.

⚠️ State size increase: Adding stopping and going_away booleans increases the serialized state size slightly, but this is negligible.

Security Concerns

✅ No security issues identified: The changes don't introduce new attack vectors or expose sensitive data.

Test Coverage

⚠️ Missing tests: This PR doesn't include tests for:

The new GoingAway signal handling
State transitions between stopping, going_away, and normal states
Edge cases around generation mismatches
Race conditions when receiving multiple signals

Recommendation: Add integration tests that:

Verify actors gracefully stop when runner goes away
Test that actors aren't immediately reallocated during runner shutdown
Verify state transitions are handled correctly
Test generation mismatch scenarios

Code Style

✅ Follows CLAUDE.md conventions:

Uses structured logging with tracing
Lowercase log messages
Uses ? for error formatting in logs
Follows workspace dependency patterns
Uses hard tabs (as specified in rustfmt.toml)

Summary

This is a solid improvement to the actor lifecycle management system. The new GoingAway signal provides better control over actor shutdown during runner maintenance. However, there are a few edge cases and potential race conditions that should be addressed before merging.

Recommendations:

Fix the GoingAway state handling to prevent race conditions
Add comprehensive tests for the new signal and state transitions
Clarify generation handling in the handle_stopped comment/code
Consider documenting the state machine more explicitly (as noted in the TODO comment)

Overall: Approve with suggested changes 👍

🤖 Generated with Claude Code

pkg-pr-new · 2025-11-07T22:52:24Z

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3413

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3413

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3413

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3413

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3413

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3413

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3413

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3413

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3413

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3413

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3413

commit: 87f1c1a

claude · 2025-11-10T22:51:40Z

PR Review: fix(pb): dont immediately reallocate actors on runner stopping

Summary

This PR refactors the actor lifecycle management in Pegboard to introduce a new GoingAway signal that prevents immediate reallocation of actors when a runner is stopping. This is a significant improvement to the shutdown behavior.

Code Quality & Best Practices

Strengths:

✅ Good separation of concerns with the new StoppedVariant enum that distinguishes between normal stops and lost actors
✅ The StoppedResult enum improves code clarity by making return intent explicit
✅ Proper use of structured logging patterns (tracing::debug!(?variant, "actor stopped"))
✅ Comments added to explain gc_timeout_ts purpose in runtime.rs:43-45
✅ Simplification of LifecycleResult by removing the kill field reduces complexity

Concerns:

⚠️ The TODO comment at runtime.rs:20-21 suggests the current state management approach needs redesign. Consider tracking this as a technical debt item
⚠️ The LifecycleState struct now has 3 boolean flags (sleeping, stopping, going_away) that represent state. This could be error-prone - consider using a proper state machine enum in the future
⚠️ The #[serde(default)] attributes on new fields (stopping, going_away) are good for backward compatibility, but ensure migration is tested

Potential Bugs & Issues

Critical:

🔴 Race condition potential: In mod.rs:445-474, the GoingAway signal handler checks if sig.generation != state.generation and returns early, but there's a TOCTOU (time-of-check-time-of-use) gap between checking and using state.runner_workflow_id. Consider holding state more carefully.

Medium Priority:

🟡 In handle_stopped (mod.rs:566-731), the function clears state.going_away and state.stopping at mod.rs:603-604, but if an error occurs during rescheduling, these flags remain cleared even though the operation failed. Consider using a scope guard or only clearing on success.
🟡 The destroy.rs file removes the kill parameter entirely, but the PR doesn't show if there are callers outside this module that might break. Verify all call sites are updated.

Minor:

🟠 In mod.rs:478-482, the Destroy handler sends a stop command but doesn't wait for acknowledgment before returning. Is this intentional? Add a comment explaining the fire-and-forget semantics.

Logic & Correctness

GoingAway Signal Logic:
The new signal is well-designed for the use case:

It sets gc_timeout_ts to allow graceful shutdown
It sends CommandStopActor to the runner
It prevents duplicate stop commands with the going_away flag

However, at mod.rs:415-422, when a runner is stopping, ALL actors get the GoingAway signal. Consider:

What if some actors are already stopping or sleeping? The code doesn't check current state before sending signals.
Should there be batching or rate limiting if there are thousands of actors?

State Machine Clarity:
The refactoring improves state handling, but the interaction between sleeping, stopping, and going_away is not immediately obvious:

Can an actor be both sleeping and going_away?
Can an actor be both stopping and going_away?
Document these invariants or use a proper enum-based state machine

Performance Considerations

✅ Removing the runner_workflow_id from UpdateStateAndDbOutput reduces data transfer
✅ The handle_stopped refactoring consolidates logic paths, reducing code duplication
⚠️ In the runner stopping path (runner.rs:414-422), signals are sent to all actors in a loop. For large runner fleets, this could cause signal queue buildup. Consider using a message broadcast or batching.

Security Concerns

✅ No obvious security issues
✅ The generation checks prevent stale signals from affecting newer actor incarnations
ℹ️ Ensure the CommandStopActor signals are properly authenticated/authorized in the protocol layer (outside scope of this PR)

Test Coverage

Major Concern:

🔴 No test files found in the pegboard package. This is a significant refactoring of critical lifecycle management code without apparent test coverage.
The PR changes complex state machine logic that handles actor lifecycle transitions
Recommend adding integration tests that cover:
- Runner stopping with active actors
- Race conditions between GoingAway and actor state changes
- Generation mismatch scenarios
- The interaction between stopping, going_away, and sleeping states

Specific Code Issues

`mod.rs:637-640` - Unclear Logic

// We don't know the state of the previous generation of this actor actor if it becomes lost, send stop
// command in case it ended up allocating
if let (StoppedVariant::Lost { .. }, Some(old_runner_workflow_id)) =
    (&variant, old_runner_workflow_id)

Typo: "actor actor" → "actor"
This sends a stop command for the OLD generation. Is this safe if the old runner has already deallocated? Add a comment explaining why this is necessary.

`mod.rs:680-686` - Inconsistent Error Handling

match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => {
        match runtime::reschedule_actor(ctx, &input, state, false).await? {
            // ...
            runtime::SpawnActorOutput::Sleep | runtime::SpawnActorOutput::Destroy => {
                return Ok(StoppedResult::Destroy);
            }
        }
    }

The comment says "Its not possible for SpawnActorOutput::Sleep to be returned here" but the code handles it anyway. Either remove the arm or update the comment.

`runtime.rs:32-34` - State Flag Proliferation

#[serde(default)]
pub stopping: bool,
#[serde(default)]
pub going_away: bool,

These flags indicate the state management is becoming complex. The TODO at line 20 acknowledges this. Consider prioritizing the refactor to use proper state enums.

Recommendations

Add Tests: Critical - add integration tests for the new GoingAway signal and state transitions
State Machine Refactor: Plan the mentioned refactor to use nested enums for state clarity
Documentation: Add module-level documentation explaining the actor lifecycle states and transitions
Fix Typo: Line 637 "actor actor"
Clarify Comment: Line 681-682 about unreachable code
Consider Batching: For runner shutdown with many actors, consider batch signal sending

Verdict

This is a good refactoring that improves the shutdown behavior by preventing immediate reallocation. The code quality is generally high and follows project conventions. However, the lack of test coverage for such critical logic is concerning, and the state management approach is acknowledged (via TODO) to need improvement.

Recommendation: Request tests before merging, but the core logic looks sound.

🤖 Generated with Claude Code

claude · 2025-11-10T22:51:46Z

Code Review: Fix Pegboard Actor Immediate Reallocation on Runner Stopping

Overview

This PR addresses an issue where actors were being immediately reallocated when their runner was stopping, which is undesirable behavior. The solution introduces a new GoingAway signal and state management to handle graceful shutdown scenarios separately from lost/failed actor scenarios.

Positive Changes ✅

1. Improved State Management
The addition of stopping and going_away boolean flags to LifecycleState (runtime.rs:32-34) provides clearer state tracking for different shutdown scenarios. This is a good architectural improvement.

2. Separation of Concerns
The new GoingAway signal distinguishes between:

Unexpected actor loss (requiring immediate reallocation)
Expected shutdown due to runner draining (no immediate reallocation needed)

This is conceptually sound and addresses the core issue.

3. Code Cleanup
Removing the kill parameter from the destroy workflow simplifies the API. The decision to send stop commands is now handled at the callsite, which is clearer.

4. Better Error Handling
The refactoring of handle_stopped() with StoppedVariant enum (mod.rs:569-573) improves type safety and makes the different stopping scenarios more explicit.

Concerns & Issues 🔍

1. State Management Complexity ⚠️
The addition of stopping and going_away flags increases complexity. Both flags are cleared together in handle_stopped() (mod.rs:602-603), which suggests they might be modeling the same concept differently. Consider:

Are there scenarios where stopping is true but going_away is false?
Should these be combined into a single enum state?
The TODO comment at runtime.rs:20-21 acknowledges this complexity issue

2. Idempotency Check Missing ⚠️
In mod.rs:480-486, the GoingAway signal handler checks !state.going_away before processing, which is good. However, the Lost signal handler (mod.rs:447-470) doesn't have a similar idempotency check. This could lead to duplicate processing if multiple Lost signals are received.

3. Race Condition Potential ⚠️
In the GoingAway handler (mod.rs:480-507), there's a TOCTOU (time-of-check-time-of-use) pattern:

if !state.going_away {
    let Some(runner_workflow_id) = state.runner_workflow_id else {
        return Ok(Loop::Continue);
    };
    // ... state modifications ...
}

If runner_workflow_id becomes None between the check and the signal send, this could cause issues. Consider restructuring to capture both checks atomically.

4. reset_rescheduling Parameter Removed ⚠️
The reset_rescheduling parameter was removed from reschedule_actor() (runtime.rs:629) but it's still accepted in the GoingAway and Lost signals. However, in Lost signal handling (mod.rs:456), reset_rescheduling is checked before handle_stopped() but never passed through. This creates inconsistent behavior:

GoingAway can reset rescheduling (mod.rs:456)
Lost checks reset_rescheduling but the reset happens before determining if rescheduling is needed

5. Documentation Gaps 📝

The GoingAway signal (mod.rs:780-785) lacks documentation explaining when it should be used vs Lost
The distinction between stopping, going_away, and sleeping states needs clearer documentation
No doc comments explaining the lifecycle state machine transitions

6. Force Stop on Destroy ⚠️
In the Destroy signal handler (mod.rs:508-521), a stop command is sent even if the actor might already be stopped. While this might be intentional for cleanup, consider:

Could this create unnecessary work or log noise?
Should there be a check for current actor state?

7. Formatting Inconsistencies 🎨
Some formatting changes appear inconsistent with surrounding code:

mod.rs:293-296: Breaks let-else statement across lines
mod.rs:306-311: Nested formatting for config().pegboard()

While these follow the hard tab convention, they differ from the original style. Consider consistency with the existing codebase style.

8. Test Coverage ❗
No test files were found for the pegboard workflows. Given the complexity of state management and the critical nature of actor lifecycle management, this code would greatly benefit from:

Unit tests for handle_stopped() with different StoppedVariant inputs
Integration tests for the GoingAway vs Lost signal handling paths
State transition tests to verify the state machine correctness

Potential Bugs 🐛

1. Duplicate Stop Commands
In handle_stopped() (mod.rs:640-653), when an actor is lost, a stop command is sent to the old runner. But if GoingAway already sent a stop command, this could result in duplicate stop commands for the same actor/generation. While likely harmless, it's inefficient.

2. Unconditional reset_rescheduling in Lost Signal
At mod.rs:456, reset_rescheduling resets the state before calling handle_stopped(). This means even if force_reschedule is false and the actor is sleeping, the reschedule count gets reset. Is this intended?

Performance Considerations ⚡

The changes don't introduce obvious performance issues
Signal sending is already async, so the additional GoingAway signal shouldn't cause problems
State field additions are minimal (3 booleans)

Security Concerns 🔒

No security issues identified. The changes are internal to the actor lifecycle management system.

Recommendations 📋

Add comprehensive tests - This is critical given the complexity of state management
Document the state machine - Add clear documentation explaining all possible states and transitions
Refactor state flags - Consider consolidating stopping, going_away, and sleeping into a proper state enum as mentioned in the TODO
Add idempotency checks - Ensure all signal handlers protect against duplicate signal delivery
Fix reset_rescheduling semantics - Make the behavior consistent between Lost and GoingAway signals
Add structured logging - Include state transitions in logs to help with debugging: tracing::info!(?old_state, ?new_state, "actor state transition")

Architecture Suggestion 💡

Consider modeling the actor state more explicitly:

enum ActorRunState {
    Allocated { runner_id: Id, runner_workflow_id: Id },
    Sleeping { will_wake: bool },
    Stopping { runner_id: Id, runner_workflow_id: Id },
    GoingAway { runner_id: Id, runner_workflow_id: Id },
    Rescheduling,
}

This would make invalid states unrepresentable and clarify the state machine.

Verdict

This PR addresses a real issue and the approach is sound. However, the increasing complexity of boolean state flags is concerning and aligns with the existing TODO comment. The changes work but would benefit from the refactoring mentioned in runtime.rs:20-21 to make the state machine more transparent.

Recommendation: Approve with suggestions for follow-up work on state machine refactoring and test coverage.

MasterPtato requested a review from NathanFlurry November 7, 2025 22:49

MasterPtato mentioned this pull request Nov 7, 2025

feat(gateway): websocket hibernation #3400

Open

MasterPtato mentioned this pull request Nov 7, 2025

chore(gas): rename wake sub to bump sub #3408

Open

vercel bot deployed to Preview – rivetkit-serverless November 7, 2025 22:51 View deployment

MasterPtato mentioned this pull request Nov 8, 2025

feat(gas): implement smart pulling for horizontally scaled workers #3417

Open

NathanFlurry approved these changes Nov 8, 2025

View reviewed changes

fix(pb): dont immediately reallocate actors on runner stopping

87f1c1a

MasterPtato force-pushed the 11-07-fix_pb_dont_immediately_reallocate_actors_on_runner_stopping branch from 79bbca9 to 87f1c1a Compare November 10, 2025 22:49

MasterPtato force-pushed the 11-07-chore_gas_rename_wake_sub_to_bump_sub branch from c0c9897 to 44cfde4 Compare November 10, 2025 22:50

MasterPtato requested a review from NathanFlurry November 10, 2025 22:50

MasterPtato mentioned this pull request Nov 10, 2025

fix: improve sigterm handling for the entire runtime #3443

Open

vercel bot deployed to Preview – rivetkit-serverless November 10, 2025 22:51 View deployment

This was referenced Nov 11, 2025

fix(guard): handle websocket tasks during shutdown #3448

Open

fix(guard): dont send open msg for hibernating ws, hibernating ws keepalive #3449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pb): dont immediately reallocate actors on runner stopping #3413

fix(pb): dont immediately reallocate actors on runner stopping #3413

MasterPtato commented Nov 7, 2025

Uh oh!

vercel bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

MasterPtato commented Nov 7, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 7, 2025

Uh oh!

pkg-pr-new bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 10, 2025

Uh oh!

claude bot commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(pb): dont immediately reallocate actors on runner stopping #3413

Are you sure you want to change the base?

fix(pb): dont immediately reallocate actors on runner stopping #3413

Conversation

MasterPtato commented Nov 7, 2025

Uh oh!

vercel bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 7, 2025

PR Review: Fix actor reallocation on runner stopping

Summary

Positive Changes

Architecture Improvements

Code Quality

Issues and Concerns

1. Potential Race Condition in GoingAway Handler (High Priority)

2. Missing State Transitions (Medium Priority)

3. Generation Mismatch Edge Case (Low Priority)

4. Typo in Comment (Trivial)

Performance Considerations

Security Concerns

Test Coverage

Code Style

Summary

Recommendations:

Uh oh!

pkg-pr-new bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Nov 10, 2025

PR Review: fix(pb): dont immediately reallocate actors on runner stopping

Summary

Code Quality & Best Practices

Potential Bugs & Issues

Logic & Correctness

Performance Considerations

Security Concerns

Test Coverage

Specific Code Issues

mod.rs:637-640 - Unclear Logic

mod.rs:680-686 - Inconsistent Error Handling

runtime.rs:32-34 - State Flag Proliferation

Recommendations

Verdict

Uh oh!

claude bot commented Nov 10, 2025

Code Review: Fix Pegboard Actor Immediate Reallocation on Runner Stopping

Overview

Positive Changes ✅

Concerns & Issues 🔍

Potential Bugs 🐛

Performance Considerations ⚡

Security Concerns 🔒

Recommendations 📋

Architecture Suggestion 💡

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Nov 7, 2025 •

edited

Loading

MasterPtato commented Nov 7, 2025 •

edited

Loading

1. Potential Race Condition in `GoingAway` Handler (High Priority)

pkg-pr-new bot commented Nov 7, 2025 •

edited

Loading

`mod.rs:637-640` - Unclear Logic

`mod.rs:680-686` - Inconsistent Error Handling

`runtime.rs:32-34` - State Flag Proliferation