Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 7, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 10, 2025 10:51pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 10, 2025 10:51pm
rivet-inspector Ignored Ignored Preview Nov 10, 2025 10:51pm
rivet-site Ignored Ignored Preview Nov 10, 2025 10:51pm

Copy link
Contributor Author

MasterPtato commented Nov 7, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 7, 2025

PR Review: Fix actor reallocation on runner stopping

Summary

This PR introduces a new GoingAway signal to gracefully handle actors when a runner is stopping, preventing immediate reallocation and allowing for controlled shutdown. This is a significant improvement to the actor lifecycle management.

Positive Changes

Architecture Improvements

New GoingAway signal: Introduces a clear separation between actors being lost (unexpected) and runners going away (planned shutdown). This is a good architectural decision that makes the state transitions more explicit.

Better state tracking: Adding stopping and going_away flags to LifecycleState provides better visibility into the actor's current state transitions.

Cleaner separation of concerns: The refactoring of handle_stopped to use StoppedVariant enum and StoppedResult improves code clarity and makes the different stop scenarios more explicit.

Code Quality

Removed unnecessary kill parameter: The removal of the kill parameter from the destroy workflow and LifecycleResult simplifies the interface and removes a confusing boolean flag.

Better logging: The debug log in handle_stopped now logs the full variant which provides better observability.

Idempotency improvements: Added guards to prevent duplicate signal sends (e.g., checking !state.stopping before sending stop signals).

Issues and Concerns

1. Potential Race Condition in GoingAway Handler (High Priority)

Location: actor/mod.rs:471-477

Main::GoingAway(sig) => {
    // Ignore signals for previous generations
    if sig.generation != state.generation {
        return Ok(Loop::Continue);
    }

    if sig.reset_rescheduling {
        state.reschedule_state = Default::default();
    }

    if !state.going_away {
        let Some(runner_workflow_id) = state.runner_workflow_id else {
            return Ok(Loop::Continue);
        };

Issue: The check if !state.going_away happens AFTER checking runner_workflow_id. However, if the actor is not allocated (no runner_workflow_id), we return early but don't set going_away = true. This means the signal could be processed multiple times if the actor gets reallocated.

Recommendation: Set state.going_away = true before the runner_workflow_id check, or handle this case more explicitly.

2. Missing State Transitions (Medium Priority)

Location: actor/mod.rs:333-361

The code sets state.stopping = true when handling ActorIntentStop, but I don't see where state.going_away transitions back to false in all code paths. While it's cleared in handle_stopped:602-604, there might be edge cases where the state could get stuck.

Recommendation: Add explicit state machine documentation or assertions to ensure all transitions are covered.

3. Generation Mismatch Edge Case (Low Priority)

Location: actor/mod.rs:640-654

// We don't know the state of the previous generation of this actor actor if it becomes lost, send stop
// command in case it ended up allocating
if let (StoppedVariant::Lost { .. }, Some(old_runner_workflow_id)) =
    (&variant, old_runner_workflow_id)
{
    ctx.signal(crate::workflows::runner::Command {
        inner: protocol::Command::CommandStopActor(protocol::CommandStopActor {
            actor_id: input.actor_id.to_string(),
            generation: state.generation,
        }),
    })

Issue: The comment says "previous generation" but the code uses state.generation which is the current generation. If we're stopping the old generation, should we use state.generation - 1?

Recommendation: Verify if this is correct or update the comment to match the code's intent.

4. Typo in Comment (Trivial)

Location: actor/mod.rs:640

// We don't know the state of the previous generation of this actor actor if it becomes lost

Issue: Double "actor" in the comment.

Recommendation: Change to "of this actor if it becomes lost".

Performance Considerations

No significant performance concerns: The changes don't introduce expensive operations or blocking calls.

⚠️ State size increase: Adding stopping and going_away booleans increases the serialized state size slightly, but this is negligible.

Security Concerns

No security issues identified: The changes don't introduce new attack vectors or expose sensitive data.

Test Coverage

⚠️ Missing tests: This PR doesn't include tests for:

  • The new GoingAway signal handling
  • State transitions between stopping, going_away, and normal states
  • Edge cases around generation mismatches
  • Race conditions when receiving multiple signals

Recommendation: Add integration tests that:

  1. Verify actors gracefully stop when runner goes away
  2. Test that actors aren't immediately reallocated during runner shutdown
  3. Verify state transitions are handled correctly
  4. Test generation mismatch scenarios

Code Style

Follows CLAUDE.md conventions:

  • Uses structured logging with tracing
  • Lowercase log messages
  • Uses ? for error formatting in logs
  • Follows workspace dependency patterns
  • Uses hard tabs (as specified in rustfmt.toml)

Summary

This is a solid improvement to the actor lifecycle management system. The new GoingAway signal provides better control over actor shutdown during runner maintenance. However, there are a few edge cases and potential race conditions that should be addressed before merging.

Recommendations:

  1. Fix the GoingAway state handling to prevent race conditions
  2. Add comprehensive tests for the new signal and state transitions
  3. Clarify generation handling in the handle_stopped comment/code
  4. Consider documenting the state machine more explicitly (as noted in the TODO comment)

Overall: Approve with suggested changes 👍


🤖 Generated with Claude Code

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 7, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3413

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3413

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3413

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3413

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3413

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3413

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3413

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3413

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3413

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3413

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3413

commit: 87f1c1a

@MasterPtato MasterPtato force-pushed the 11-07-fix_pb_dont_immediately_reallocate_actors_on_runner_stopping branch from 79bbca9 to 87f1c1a Compare November 10, 2025 22:49
@MasterPtato MasterPtato force-pushed the 11-07-chore_gas_rename_wake_sub_to_bump_sub branch from c0c9897 to 44cfde4 Compare November 10, 2025 22:50
@claude
Copy link

claude bot commented Nov 10, 2025

PR Review: fix(pb): dont immediately reallocate actors on runner stopping

Summary

This PR refactors the actor lifecycle management in Pegboard to introduce a new GoingAway signal that prevents immediate reallocation of actors when a runner is stopping. This is a significant improvement to the shutdown behavior.


Code Quality & Best Practices

Strengths:

  • ✅ Good separation of concerns with the new StoppedVariant enum that distinguishes between normal stops and lost actors
  • ✅ The StoppedResult enum improves code clarity by making return intent explicit
  • ✅ Proper use of structured logging patterns (tracing::debug!(?variant, "actor stopped"))
  • ✅ Comments added to explain gc_timeout_ts purpose in runtime.rs:43-45
  • ✅ Simplification of LifecycleResult by removing the kill field reduces complexity

Concerns:

  • ⚠️ The TODO comment at runtime.rs:20-21 suggests the current state management approach needs redesign. Consider tracking this as a technical debt item
  • ⚠️ The LifecycleState struct now has 3 boolean flags (sleeping, stopping, going_away) that represent state. This could be error-prone - consider using a proper state machine enum in the future
  • ⚠️ The #[serde(default)] attributes on new fields (stopping, going_away) are good for backward compatibility, but ensure migration is tested

Potential Bugs & Issues

Critical:

  • 🔴 Race condition potential: In mod.rs:445-474, the GoingAway signal handler checks if sig.generation != state.generation and returns early, but there's a TOCTOU (time-of-check-time-of-use) gap between checking and using state.runner_workflow_id. Consider holding state more carefully.

Medium Priority:

  • 🟡 In handle_stopped (mod.rs:566-731), the function clears state.going_away and state.stopping at mod.rs:603-604, but if an error occurs during rescheduling, these flags remain cleared even though the operation failed. Consider using a scope guard or only clearing on success.
  • 🟡 The destroy.rs file removes the kill parameter entirely, but the PR doesn't show if there are callers outside this module that might break. Verify all call sites are updated.

Minor:

  • 🟠 In mod.rs:478-482, the Destroy handler sends a stop command but doesn't wait for acknowledgment before returning. Is this intentional? Add a comment explaining the fire-and-forget semantics.

Logic & Correctness

GoingAway Signal Logic:
The new signal is well-designed for the use case:

  • It sets gc_timeout_ts to allow graceful shutdown
  • It sends CommandStopActor to the runner
  • It prevents duplicate stop commands with the going_away flag

However, at mod.rs:415-422, when a runner is stopping, ALL actors get the GoingAway signal. Consider:

  • What if some actors are already stopping or sleeping? The code doesn't check current state before sending signals.
  • Should there be batching or rate limiting if there are thousands of actors?

State Machine Clarity:
The refactoring improves state handling, but the interaction between sleeping, stopping, and going_away is not immediately obvious:

  • Can an actor be both sleeping and going_away?
  • Can an actor be both stopping and going_away?
  • Document these invariants or use a proper enum-based state machine

Performance Considerations

  • ✅ Removing the runner_workflow_id from UpdateStateAndDbOutput reduces data transfer
  • ✅ The handle_stopped refactoring consolidates logic paths, reducing code duplication
  • ⚠️ In the runner stopping path (runner.rs:414-422), signals are sent to all actors in a loop. For large runner fleets, this could cause signal queue buildup. Consider using a message broadcast or batching.

Security Concerns

  • ✅ No obvious security issues
  • ✅ The generation checks prevent stale signals from affecting newer actor incarnations
  • ℹ️ Ensure the CommandStopActor signals are properly authenticated/authorized in the protocol layer (outside scope of this PR)

Test Coverage

Major Concern:

  • 🔴 No test files found in the pegboard package. This is a significant refactoring of critical lifecycle management code without apparent test coverage.
  • The PR changes complex state machine logic that handles actor lifecycle transitions
  • Recommend adding integration tests that cover:
    • Runner stopping with active actors
    • Race conditions between GoingAway and actor state changes
    • Generation mismatch scenarios
    • The interaction between stopping, going_away, and sleeping states

Specific Code Issues

mod.rs:637-640 - Unclear Logic

// We don't know the state of the previous generation of this actor actor if it becomes lost, send stop
// command in case it ended up allocating
if let (StoppedVariant::Lost { .. }, Some(old_runner_workflow_id)) =
    (&variant, old_runner_workflow_id)
  • Typo: "actor actor" → "actor"
  • This sends a stop command for the OLD generation. Is this safe if the old runner has already deallocated? Add a comment explaining why this is necessary.

mod.rs:680-686 - Inconsistent Error Handling

match (input.crash_policy, failed) {
    (CrashPolicy::Restart, true) => {
        match runtime::reschedule_actor(ctx, &input, state, false).await? {
            // ...
            runtime::SpawnActorOutput::Sleep | runtime::SpawnActorOutput::Destroy => {
                return Ok(StoppedResult::Destroy);
            }
        }
    }

The comment says "Its not possible for SpawnActorOutput::Sleep to be returned here" but the code handles it anyway. Either remove the arm or update the comment.

runtime.rs:32-34 - State Flag Proliferation

#[serde(default)]
pub stopping: bool,
#[serde(default)]
pub going_away: bool,

These flags indicate the state management is becoming complex. The TODO at line 20 acknowledges this. Consider prioritizing the refactor to use proper state enums.


Recommendations

  1. Add Tests: Critical - add integration tests for the new GoingAway signal and state transitions
  2. State Machine Refactor: Plan the mentioned refactor to use nested enums for state clarity
  3. Documentation: Add module-level documentation explaining the actor lifecycle states and transitions
  4. Fix Typo: Line 637 "actor actor"
  5. Clarify Comment: Line 681-682 about unreachable code
  6. Consider Batching: For runner shutdown with many actors, consider batch signal sending

Verdict

This is a good refactoring that improves the shutdown behavior by preventing immediate reallocation. The code quality is generally high and follows project conventions. However, the lack of test coverage for such critical logic is concerning, and the state management approach is acknowledged (via TODO) to need improvement.

Recommendation: Request tests before merging, but the core logic looks sound.


🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 10, 2025

Code Review: Fix Pegboard Actor Immediate Reallocation on Runner Stopping

Overview

This PR addresses an issue where actors were being immediately reallocated when their runner was stopping, which is undesirable behavior. The solution introduces a new GoingAway signal and state management to handle graceful shutdown scenarios separately from lost/failed actor scenarios.

Positive Changes ✅

1. Improved State Management
The addition of stopping and going_away boolean flags to LifecycleState (runtime.rs:32-34) provides clearer state tracking for different shutdown scenarios. This is a good architectural improvement.

2. Separation of Concerns
The new GoingAway signal distinguishes between:

  • Unexpected actor loss (requiring immediate reallocation)
  • Expected shutdown due to runner draining (no immediate reallocation needed)

This is conceptually sound and addresses the core issue.

3. Code Cleanup
Removing the kill parameter from the destroy workflow simplifies the API. The decision to send stop commands is now handled at the callsite, which is clearer.

4. Better Error Handling
The refactoring of handle_stopped() with StoppedVariant enum (mod.rs:569-573) improves type safety and makes the different stopping scenarios more explicit.

Concerns & Issues 🔍

1. State Management Complexity ⚠️
The addition of stopping and going_away flags increases complexity. Both flags are cleared together in handle_stopped() (mod.rs:602-603), which suggests they might be modeling the same concept differently. Consider:

  • Are there scenarios where stopping is true but going_away is false?
  • Should these be combined into a single enum state?
  • The TODO comment at runtime.rs:20-21 acknowledges this complexity issue

2. Idempotency Check Missing ⚠️
In mod.rs:480-486, the GoingAway signal handler checks !state.going_away before processing, which is good. However, the Lost signal handler (mod.rs:447-470) doesn't have a similar idempotency check. This could lead to duplicate processing if multiple Lost signals are received.

3. Race Condition Potential ⚠️
In the GoingAway handler (mod.rs:480-507), there's a TOCTOU (time-of-check-time-of-use) pattern:

if !state.going_away {
    let Some(runner_workflow_id) = state.runner_workflow_id else {
        return Ok(Loop::Continue);
    };
    // ... state modifications ...
}

If runner_workflow_id becomes None between the check and the signal send, this could cause issues. Consider restructuring to capture both checks atomically.

4. reset_rescheduling Parameter Removed ⚠️
The reset_rescheduling parameter was removed from reschedule_actor() (runtime.rs:629) but it's still accepted in the GoingAway and Lost signals. However, in Lost signal handling (mod.rs:456), reset_rescheduling is checked before handle_stopped() but never passed through. This creates inconsistent behavior:

  • GoingAway can reset rescheduling (mod.rs:456)
  • Lost checks reset_rescheduling but the reset happens before determining if rescheduling is needed

5. Documentation Gaps 📝

  • The GoingAway signal (mod.rs:780-785) lacks documentation explaining when it should be used vs Lost
  • The distinction between stopping, going_away, and sleeping states needs clearer documentation
  • No doc comments explaining the lifecycle state machine transitions

6. Force Stop on Destroy ⚠️
In the Destroy signal handler (mod.rs:508-521), a stop command is sent even if the actor might already be stopped. While this might be intentional for cleanup, consider:

  • Could this create unnecessary work or log noise?
  • Should there be a check for current actor state?

7. Formatting Inconsistencies 🎨
Some formatting changes appear inconsistent with surrounding code:

  • mod.rs:293-296: Breaks let-else statement across lines
  • mod.rs:306-311: Nested formatting for config().pegboard()

While these follow the hard tab convention, they differ from the original style. Consider consistency with the existing codebase style.

8. Test Coverage
No test files were found for the pegboard workflows. Given the complexity of state management and the critical nature of actor lifecycle management, this code would greatly benefit from:

  • Unit tests for handle_stopped() with different StoppedVariant inputs
  • Integration tests for the GoingAway vs Lost signal handling paths
  • State transition tests to verify the state machine correctness

Potential Bugs 🐛

1. Duplicate Stop Commands
In handle_stopped() (mod.rs:640-653), when an actor is lost, a stop command is sent to the old runner. But if GoingAway already sent a stop command, this could result in duplicate stop commands for the same actor/generation. While likely harmless, it's inefficient.

2. Unconditional reset_rescheduling in Lost Signal
At mod.rs:456, reset_rescheduling resets the state before calling handle_stopped(). This means even if force_reschedule is false and the actor is sleeping, the reschedule count gets reset. Is this intended?

Performance Considerations ⚡

  • The changes don't introduce obvious performance issues
  • Signal sending is already async, so the additional GoingAway signal shouldn't cause problems
  • State field additions are minimal (3 booleans)

Security Concerns 🔒

No security issues identified. The changes are internal to the actor lifecycle management system.

Recommendations 📋

  1. Add comprehensive tests - This is critical given the complexity of state management
  2. Document the state machine - Add clear documentation explaining all possible states and transitions
  3. Refactor state flags - Consider consolidating stopping, going_away, and sleeping into a proper state enum as mentioned in the TODO
  4. Add idempotency checks - Ensure all signal handlers protect against duplicate signal delivery
  5. Fix reset_rescheduling semantics - Make the behavior consistent between Lost and GoingAway signals
  6. Add structured logging - Include state transitions in logs to help with debugging: tracing::info!(?old_state, ?new_state, "actor state transition")

Architecture Suggestion 💡

Consider modeling the actor state more explicitly:

enum ActorRunState {
    Allocated { runner_id: Id, runner_workflow_id: Id },
    Sleeping { will_wake: bool },
    Stopping { runner_id: Id, runner_workflow_id: Id },
    GoingAway { runner_id: Id, runner_workflow_id: Id },
    Rescheduling,
}

This would make invalid states unrepresentable and clarify the state machine.

Verdict

This PR addresses a real issue and the approach is sound. However, the increasing complexity of boolean state flags is concerning and aligns with the existing TODO comment. The changes work but would benefit from the refactoring mentioned in runtime.rs:20-21 to make the state machine more transparent.

Recommendation: Approve with suggestions for follow-up work on state machine refactoring and test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants