Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 13, 2025

Deployment failed with the following error:

Too many requests - try again in 3 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

Copy link
Member Author

NathanFlurry commented Nov 13, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 13, 2025

PR Review: Cloudflare Workers Actor State Management Fix

This PR addresses critical issues with actor lifecycle management in Cloudflare Workers by refactoring how actor state is tracked and managed. Overall, this is a well-structured fix that addresses important memory management and state synchronization issues.

Summary of Changes

The PR refactors actor state management by:

  1. Moving from Map to WeakMap for actor state storage to enable proper garbage collection
  2. Separating state concerns into InitializedData, LoadedActor, and a consolidated ActorGlobalState
  3. Fixing async destroy operations by properly awaiting cleanup operations
  4. Improving generation handling to prevent stale actor instances after resurrection
  5. Unifying RPC response types with explicit success/error discriminated unions

Code Quality & Best Practices

Strengths

  1. Memory Management Improvement: The switch from Map<string, ActorHandler> to WeakMap<DurableObjectState, ActorGlobalState> is excellent. This allows the garbage collector to clean up actor state when DOs are evicted, preventing memory leaks.

  2. Strong Type Safety: The introduction of discriminated unions (ActorInitResponse) for RPC responses is a clean pattern that forces proper error handling at call sites.

  3. Generation Safety: The new invariant check at actor-handler-do.ts:158-163 catches stale cached actors with mismatched generations, which is crucial for correctness after actor resurrection.

  4. Explicit State Reset: The reset() method on ActorGlobalState provides a clear contract for cleanup.

⚠️ Concerns & Potential Issues

1. Race Condition in startDestroy (Critical)

Location: actor-driver.ts:264-285

The startDestroy method spawns an async cleanup operation without tracking or awaiting it:

startDestroy(actorId: string): void {
    // ... checks ...
    handler.destroying = true;
    
    // Spawn onStop
    this.#callOnStopAsync(actorId, doId, handler.actorInstance);
}

Issues:

  • If a new request comes in before #callOnStopAsync completes, the actor might be in an inconsistent state
  • The destroying flag is set immediately, but cleanup happens asynchronously
  • No way to wait for destruction to complete if needed

Recommendation:

// Consider returning a promise or tracking pending destroys
private pendingDestroys = new Map<string, Promise<void>>();

startDestroy(actorId: string): void {
    // ... existing checks ...
    const destroyPromise = this.#callOnStopAsync(actorId, doId, handler.actorInstance);
    this.pendingDestroys.set(doId, destroyPromise);
    destroyPromise.finally(() => this.pendingDestroys.delete(doId));
}

// Then in loadActor, check:
const pendingDestroy = this.pendingDestroys.get(doId);
if (pendingDestroy) {
    await pendingDestroy;
}

2. Inconsistent State Initialization Pattern

Location: actor-handler-do.ts:107-112 vs actor-handler-do.ts:366-370

The constructor initializes #state from the WeakMap, but create() also has fallback initialization logic:

// Constructor (line 107-112)
this.#state = globalState.getActorState(this.ctx);
if (!this.#state) {
    this.#state = new ActorGlobalState();
    globalState.setActorState(this.ctx, this.#state);
}

// create() method (line 366-370)
if (!this.#state) {
    this.#state = new ActorGlobalState();
    globalState.setActorState(this.ctx, this.#state);
}

Issue: This duplication suggests uncertainty about when state might not exist. If the constructor always runs first, the check in create() should be an invariant.

Recommendation:

// In create(), replace with:
invariant(this.#state, "State should be initialized in constructor");

3. Potential Memory Leak with Strong References

Location: actor-handler-do.ts:72-77

The comment mentions that ActorHandler holds a strong reference while the global state holds a weak reference:

/**
 * This holds a strong reference to ActorGlobalState.
 * CloudflareDurableObjectGlobalState holds a weak reference so we can
 * access it elsewhere.
 **/
#state?: ActorGlobalState;

Issue: The ActorGlobalState contains references to actorRouter, actorDriver, and actorInstance which may hold significant resources. If a DO instance stays alive but idle for a long time, these resources won't be released.

Question: Is there a lifecycle hook when the DO becomes idle where you could clear #state.actor but keep #state.initialized?

4. Missing Error Handling in Async Cleanup

Location: actor-driver.ts:287-313

The #callOnStopAsync method has no error handling:

async #callOnStopAsync(
    actorId: string,
    doId: string,
    actor: CoreAnyActorInstance,
) {
    // Stop
    await actor.onStop("destroy");
    // ... cleanup operations ...
}

Issue: If onStop throws, the cleanup operations (SQL deletes, alarm deletion, KV deletion) won't execute, leaving orphaned state.

Recommendation:

async #callOnStopAsync(
    actorId: string,
    doId: string,
    actor: CoreAnyActorInstance,
) {
    try {
        await actor.onStop("destroy");
    } catch (error) {
        logger().error({ 
            msg: "error in actor onStop", 
            actorId, 
            error: stringifyError(error) 
        });
        // Continue with cleanup even if onStop fails
    }

    // ... cleanup operations ...
}

5. Stale Actor Clearing Could Be More Robust

Location: actor-handler-do.ts:328-333

When resurrecting a destroyed actor, the code clears #state.actor but not other fields:

if (this.#state) {
    this.#state.actor = undefined;
}

Recommendation: Use the reset() method for consistency:

if (this.#state) {
    // Keep initialized data, reset actor instance
    const initializedData = this.#state.initialized;
    this.#state.reset();
    this.#state.initialized = initializedData;
}

6. Type Safety: ActorKey Type Inconsistency

Location: actor-driver.ts:169

const key = JSON.parse(result.value[1] as string) as string[];

But ActorKey is imported and used elsewhere. The cast to string[] should be ActorKey for consistency.


Performance Considerations

  1. ✅ Good: The WeakMap approach is excellent for memory management
  2. ⚠️ Concern: The loadActor method at actor-driver.ts:132-203 performs SQL queries on every call when the actor isn't cached. Consider adding debug logging to track cache hit/miss rates.

Security Concerns

No critical security issues identified, but note:

  1. The generation mechanism provides good protection against accessing stale actors
  2. Consider rate limiting on the create RPC endpoint if not already implemented upstream
  3. The SQL queries use parameterized statements (good), but ensure the SQL storage is properly sandboxed per DO

Test Coverage

⚠️ Missing Tests: The PR doesn't include test updates. Given the critical nature of these changes, I recommend adding tests for:

  1. Actor resurrection: Create → Destroy → Create with same key should increment generation
  2. Concurrent destroy: Ensure destroying flag prevents double-destroys
  3. Generation mismatch: Loading with wrong generation should fail gracefully
  4. WeakMap GC behavior: Verify state is properly cleaned up (may need integration test)
  5. Error handling in onStop: Verify cleanup continues even if onStop throws

Existing tests: Check if rivetkit-typescript/packages/cloudflare-workers/tests/driver-tests.test.ts covers these scenarios.


Documentation

The code comments are generally good, but consider adding:

  1. JSDoc on ActorGlobalState explaining the lifecycle and when each field is set
  2. Comment on startDestroy explaining why it's fire-and-forget
  3. Architecture doc explaining the generation system and WeakMap strategy

Suggestions for Follow-up

  1. Add metrics/tracing around actor lifecycle events (create, load, destroy, resurrect)
  2. Consider exposing generation number in admin/debug endpoints
  3. Add Durable Object hibernation support if Cloudflare adds it (to release resources during idle periods)

Overall Assessment

Verdict: ✅ Approve with minor changes recommended

This PR makes important improvements to memory management and state consistency. The core architectural changes (WeakMap, generation handling, type safety) are solid. The main concerns are:

  1. Must fix: Error handling in #callOnStopAsync
  2. Should fix: Race condition documentation/handling in startDestroy
  3. Nice to have: Test coverage for the new behavior

The changes are well-structured and show good understanding of Cloudflare Workers' Durable Objects lifecycle. Once the error handling is addressed, this will be a solid improvement to the codebase.


Great work on tackling this complex state management issue! 🚀

@NathanFlurry NathanFlurry changed the base branch from 11-12-feat_rivetkit_add_ability_to_destroy_actors to graphite-base/3459 November 13, 2025 02:27
@NathanFlurry NathanFlurry removed the request for review from MasterPtato November 13, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants