wip: cf workers fix #3459

NathanFlurry · 2025-11-13T01:33:41Z

No description provided.

vercel · 2025-11-13T01:33:46Z

Deployment failed with the following error:

Too many requests - try again in 3 minutes (more than 120, code: "api-deployments-flood-pro").

Learn More: https://vercel.com/docs/limits#rate-limits

NathanFlurry · 2025-11-13T01:33:58Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2025-11-13T01:45:55Z

PR Review: Cloudflare Workers Actor State Management Fix

This PR addresses critical issues with actor lifecycle management in Cloudflare Workers by refactoring how actor state is tracked and managed. Overall, this is a well-structured fix that addresses important memory management and state synchronization issues.

Summary of Changes

The PR refactors actor state management by:

Moving from Map to WeakMap for actor state storage to enable proper garbage collection
Separating state concerns into InitializedData, LoadedActor, and a consolidated ActorGlobalState
Fixing async destroy operations by properly awaiting cleanup operations
Improving generation handling to prevent stale actor instances after resurrection
Unifying RPC response types with explicit success/error discriminated unions

Code Quality & Best Practices

✅ Strengths

Memory Management Improvement: The switch from Map<string, ActorHandler> to WeakMap<DurableObjectState, ActorGlobalState> is excellent. This allows the garbage collector to clean up actor state when DOs are evicted, preventing memory leaks.
Strong Type Safety: The introduction of discriminated unions (ActorInitResponse) for RPC responses is a clean pattern that forces proper error handling at call sites.
Generation Safety: The new invariant check at actor-handler-do.ts:158-163 catches stale cached actors with mismatched generations, which is crucial for correctness after actor resurrection.
Explicit State Reset: The reset() method on ActorGlobalState provides a clear contract for cleanup.

⚠️ Concerns & Potential Issues

1. Race Condition in `startDestroy` (Critical)

Location: actor-driver.ts:264-285

The startDestroy method spawns an async cleanup operation without tracking or awaiting it:

startDestroy(actorId: string): void {
    // ... checks ...
    handler.destroying = true;
    
    // Spawn onStop
    this.#callOnStopAsync(actorId, doId, handler.actorInstance);
}

Issues:

If a new request comes in before #callOnStopAsync completes, the actor might be in an inconsistent state
The destroying flag is set immediately, but cleanup happens asynchronously
No way to wait for destruction to complete if needed

Recommendation:

// Consider returning a promise or tracking pending destroys
private pendingDestroys = new Map<string, Promise<void>>();

startDestroy(actorId: string): void {
    // ... existing checks ...
    const destroyPromise = this.#callOnStopAsync(actorId, doId, handler.actorInstance);
    this.pendingDestroys.set(doId, destroyPromise);
    destroyPromise.finally(() => this.pendingDestroys.delete(doId));
}

// Then in loadActor, check:
const pendingDestroy = this.pendingDestroys.get(doId);
if (pendingDestroy) {
    await pendingDestroy;
}

2. Inconsistent State Initialization Pattern

Location: actor-handler-do.ts:107-112 vs actor-handler-do.ts:366-370

The constructor initializes #state from the WeakMap, but create() also has fallback initialization logic:

// Constructor (line 107-112)
this.#state = globalState.getActorState(this.ctx);
if (!this.#state) {
    this.#state = new ActorGlobalState();
    globalState.setActorState(this.ctx, this.#state);
}

// create() method (line 366-370)
if (!this.#state) {
    this.#state = new ActorGlobalState();
    globalState.setActorState(this.ctx, this.#state);
}

Issue: This duplication suggests uncertainty about when state might not exist. If the constructor always runs first, the check in create() should be an invariant.

Recommendation:

// In create(), replace with:
invariant(this.#state, "State should be initialized in constructor");

3. Potential Memory Leak with Strong References

Location: actor-handler-do.ts:72-77

The comment mentions that ActorHandler holds a strong reference while the global state holds a weak reference:

/**
 * This holds a strong reference to ActorGlobalState.
 * CloudflareDurableObjectGlobalState holds a weak reference so we can
 * access it elsewhere.
 **/
#state?: ActorGlobalState;

Issue: The ActorGlobalState contains references to actorRouter, actorDriver, and actorInstance which may hold significant resources. If a DO instance stays alive but idle for a long time, these resources won't be released.

Question: Is there a lifecycle hook when the DO becomes idle where you could clear #state.actor but keep #state.initialized?

4. Missing Error Handling in Async Cleanup

Location: actor-driver.ts:287-313

The #callOnStopAsync method has no error handling:

async #callOnStopAsync(
    actorId: string,
    doId: string,
    actor: CoreAnyActorInstance,
) {
    // Stop
    await actor.onStop("destroy");
    // ... cleanup operations ...
}

Issue: If onStop throws, the cleanup operations (SQL deletes, alarm deletion, KV deletion) won't execute, leaving orphaned state.

Recommendation:

async #callOnStopAsync(
    actorId: string,
    doId: string,
    actor: CoreAnyActorInstance,
) {
    try {
        await actor.onStop("destroy");
    } catch (error) {
        logger().error({ 
            msg: "error in actor onStop", 
            actorId, 
            error: stringifyError(error) 
        });
        // Continue with cleanup even if onStop fails
    }

    // ... cleanup operations ...
}

5. Stale Actor Clearing Could Be More Robust

Location: actor-handler-do.ts:328-333

When resurrecting a destroyed actor, the code clears #state.actor but not other fields:

if (this.#state) {
    this.#state.actor = undefined;
}

Recommendation: Use the reset() method for consistency:

if (this.#state) {
    // Keep initialized data, reset actor instance
    const initializedData = this.#state.initialized;
    this.#state.reset();
    this.#state.initialized = initializedData;
}

6. Type Safety: ActorKey Type Inconsistency

Location: actor-driver.ts:169

const key = JSON.parse(result.value[1] as string) as string[];

But ActorKey is imported and used elsewhere. The cast to string[] should be ActorKey for consistency.

Performance Considerations

✅ Good: The WeakMap approach is excellent for memory management
⚠️ Concern: The loadActor method at actor-driver.ts:132-203 performs SQL queries on every call when the actor isn't cached. Consider adding debug logging to track cache hit/miss rates.

Security Concerns

No critical security issues identified, but note:

The generation mechanism provides good protection against accessing stale actors
Consider rate limiting on the create RPC endpoint if not already implemented upstream
The SQL queries use parameterized statements (good), but ensure the SQL storage is properly sandboxed per DO

Test Coverage

⚠️ Missing Tests: The PR doesn't include test updates. Given the critical nature of these changes, I recommend adding tests for:

Actor resurrection: Create → Destroy → Create with same key should increment generation
Concurrent destroy: Ensure destroying flag prevents double-destroys
Generation mismatch: Loading with wrong generation should fail gracefully
WeakMap GC behavior: Verify state is properly cleaned up (may need integration test)
Error handling in onStop: Verify cleanup continues even if onStop throws

Existing tests: Check if rivetkit-typescript/packages/cloudflare-workers/tests/driver-tests.test.ts covers these scenarios.

Documentation

The code comments are generally good, but consider adding:

JSDoc on ActorGlobalState explaining the lifecycle and when each field is set
Comment on startDestroy explaining why it's fire-and-forget
Architecture doc explaining the generation system and WeakMap strategy

Suggestions for Follow-up

Add metrics/tracing around actor lifecycle events (create, load, destroy, resurrect)
Consider exposing generation number in admin/debug endpoints
Add Durable Object hibernation support if Cloudflare adds it (to release resources during idle periods)

Overall Assessment

Verdict: ✅ Approve with minor changes recommended

This PR makes important improvements to memory management and state consistency. The core architectural changes (WeakMap, generation handling, type safety) are solid. The main concerns are:

Must fix: Error handling in #callOnStopAsync
Should fix: Race condition documentation/handling in startDestroy
Nice to have: Test coverage for the new behavior

The changes are well-structured and show good understanding of Cloudflare Workers' Durable Objects lifecycle. Once the error handling is addressed, this will be a solid improvement to the codebase.

Great work on tackling this complex state management issue! 🚀

wip: cf workers fix

1f60e30

This was referenced Nov 13, 2025

fix(rivetkit): skip sending RivetKit messages to conns that do not support it #3440

Open

fix(rivetkit): fix race condition with websocket open events #3439

Open

NicholasKissel mentioned this pull request Nov 13, 2025

chore: create branch for v3 website changes on v2 #3304

Open

NathanFlurry mentioned this pull request Nov 13, 2025

chore(website): flatten actors docs structure #3280

Open

This was referenced Nov 13, 2025

feat(rivet-engine): udb key parser #3457

Open

feat(rivetkit): add ability to destroy actors #3458

Open

NathanFlurry changed the base branch from 11-12-feat_rivetkit_add_ability_to_destroy_actors to graphite-base/3459 November 13, 2025 02:27

NathanFlurry requested a review from MasterPtato November 13, 2025 02:45

NathanFlurry closed this Nov 13, 2025

NathanFlurry removed the request for review from MasterPtato November 13, 2025 02:46

This was referenced Nov 13, 2025

chore(rivetkit-typescript): remove dependency on node modules #3460

Open

chore(rivetkit): switch dynamic node imports to use require #3461

Open

chore(rivetkit): fix type checks #3462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wip: cf workers fix #3459

wip: cf workers fix #3459

NathanFlurry commented Nov 13, 2025

Uh oh!

vercel bot commented Nov 13, 2025

Uh oh!

NathanFlurry commented Nov 13, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wip: cf workers fix #3459

wip: cf workers fix #3459

Conversation

NathanFlurry commented Nov 13, 2025

Uh oh!

vercel bot commented Nov 13, 2025

Uh oh!

NathanFlurry commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Nov 13, 2025

PR Review: Cloudflare Workers Actor State Management Fix

Summary of Changes

Code Quality & Best Practices

✅ Strengths

⚠️ Concerns & Potential Issues

1. Race Condition in startDestroy (Critical)

2. Inconsistent State Initialization Pattern

3. Potential Memory Leak with Strong References

4. Missing Error Handling in Async Cleanup

5. Stale Actor Clearing Could Be More Robust

6. Type Safety: ActorKey Type Inconsistency

Performance Considerations

Security Concerns

Test Coverage

Documentation

Suggestions for Follow-up

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NathanFlurry commented Nov 13, 2025 •

edited

Loading

1. Race Condition in `startDestroy` (Critical)