Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 9, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 13, 2025 2:48am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 13, 2025 2:48am
rivet-inspector Ignored Ignored Preview Nov 13, 2025 2:48am
rivet-site Ignored Ignored Preview Nov 13, 2025 2:48am

Copy link
Member Author

NathanFlurry commented Nov 9, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 9, 2025

Pull Request Review: Move Connections to Separate Persisted KV Keys

Summary

This PR refactors the actor persistence layer to store connection data in separate KV entries instead of bundling everything into a single actor state blob. This is a significant architectural improvement with good design choices.

Positive Highlights

Architecture & Design

  • Separation of concerns: Moving connection data to individual KV entries is a smart design decision that reduces write amplification when only connection state changes
  • Consistent key schema: The use of typed prefixes ([1] for actor data, [2] for connections) provides a clean namespace separation
  • Batch operations: New KV batch API (kvBatchPut, kvBatchGet, kvBatchDelete, kvListPrefix) is well-designed and supports efficient bulk operations
  • Change tracking: The #changedConnections set and per-connection hasChanges flag minimize unnecessary writes

Code Quality

  • Schema versioning: Proper use of versioned schemas (v2 for file-system-driver, v3 for actor-persist) with BARE encoding
  • Atomic operations: File system driver uses proper atomic writes with temp files and rename operations (lines 409-429 in global-state.ts)
  • Error handling: Good error handling with try-catch blocks and cleanup of temp files
  • Logging: Comprehensive structured logging throughout the changes

Issues & Concerns

🔴 Critical: Race Condition in Connection Removal

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:1006-1023

#removeConn(conn: Conn<S, CP, CS, V, I, DB>) {
    // Remove from KV directly
    this.#removeConnFromKV(conn.id).catch((error) => {
        this.#rLog.error({
            msg: "failed to remove connection from KV",
            connId: conn.id,
            error: stringifyError(error),
        });
    });

    // Connection is now removed from KV storage only

    // Remove from state and tracking
    this.#connections.delete(conn.id);
    this.#changedConnections.delete(conn.id);

Problem: The KV deletion is fire-and-forget with .catch(). If the deletion fails, the connection is removed from memory but may still exist in storage, causing:

  1. Stale connection data persisted indefinitely
  2. Memory leaks when actor restarts and loads stale connections
  3. Potential security issue if connections contain sensitive data

Recommendation: Make this async and await the deletion:

async #removeConn(conn: Conn<S, CP, CS, V, I, DB>) {
    // Remove from KV storage first
    try {
        await this.#removeConnFromKV(conn.id);
    } catch (error) {
        this.#rLog.error({
            msg: "failed to remove connection from KV",
            connId: conn.id,
            error: stringifyError(error),
        });
        throw error; // Or decide on proper error handling
    }

    // Only remove from memory after successful KV deletion
    this.#connections.delete(conn.id);
    this.#changedConnections.delete(conn.id);
    // ... rest of cleanup
}

🟡 Potential Issue: Missing Error Recovery

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:647-684

The #writePersistedData method writes actor state and connections in a batch, but if the batch write fails partway through:

  • Some connections may be saved while others aren't
  • No rollback mechanism exists
  • Could lead to inconsistent state

Recommendation: Consider adding:

  1. Transaction-like semantics or retry logic
  2. Better error messaging to indicate which connections failed to save
  3. Consider whether partial writes should fail the entire operation

🟡 Schema Naming Inconsistency

Location: rivetkit-typescript/packages/rivetkit/schemas/actor-persist/v3.bare

The schema types changed from PersistedActor/PersistedHibernatableConn to Actor/HibernatableConn, but the TypeScript interfaces still use the Persisted* naming convention. This creates confusion between the serialized schema types and the in-memory types.

Recommendation: Either:

  1. Keep schema names consistent with TypeScript interfaces (PersistedActor)
  2. Or update TypeScript to match schema (ActorPersistedActor distinction)

🟡 Type Safety: Unused TextDecoder

Location: rivetkit-typescript/packages/cloudflare-workers/src/actor-driver.ts:146-155

async kvBatchPut(
    actorId: string,
    entries: [Uint8Array, Uint8Array][],
): Promise<void> {
    const storage = this.#getDOCtx(actorId).storage;
    const encoder = new TextDecoder(); // ⚠️ Never used

The encoder variable is declared but never used. Should be removed.

🟡 Performance: Missing Index for Prefix Searches

Location: rivetkit-typescript/packages/rivetkit/src/drivers/file-system/global-state.ts:844-874

The kvListPrefix implementation does a linear scan through all KV entries:

for (const kvEntry of entry.state.kvStorage) {
    const keyBytes = new Uint8Array(kvEntry.key);
    // Check if key starts with prefix
    if (keyBytes.length >= prefix.length) {
        let hasPrefix = true;
        for (let i = 0; i < prefix.length; i++) {
            if (keyBytes[i] !== prefix[i]) {
                hasPrefix = false;
                break;
            }
        }

Impact: O(n*m) complexity where n = number of keys, m = prefix length. For actors with many connections, this could be slow.

Recommendation:

  • Document expected connection counts
  • Consider adding a sorted index if connection counts are expected to be high (>100s)
  • Or accept the linear scan if connection counts are typically low (<50)

Minor Issues

Documentation

  • Missing JSDoc comments for new KV batch methods in ActorDriver interface
  • The makeConnKey function could use a comment explaining the key format
  • Schema migration path from v2→v3 should be documented

Code Style

  • Consider extracting the key comparison logic in kvListPrefix to a helper function for readability
  • The #uint8ArrayToKey and #keyToUint8Array methods in Cloudflare driver could be static utilities

Testing

  • No new test files added for the KV batch operations
  • Consider adding tests for:
    • Connection persistence across hibernation
    • Concurrent connection modifications
    • KV prefix listing with various edge cases
    • Connection removal cleanup

Performance Considerations

Positive Impact:

  • ✅ Reduced write amplification: Only changed connections are written instead of entire actor state
  • ✅ Batch operations reduce round trips to storage
  • ✅ Change tracking minimizes unnecessary writes

Potential Concerns:

  • Increased KV operations count (one per connection instead of one per actor)
  • File system driver now writes multiple KV entries per actor (increases file size slightly)
  • Consider monitoring KV operation counts in production

Security Assessment

  • ✅ No obvious security vulnerabilities introduced
  • ✅ Proper input validation on keys and values
  • ⚠️ Connection data cleanup on removal needs to be guaranteed (see critical issue above)
  • ✅ No sensitive data appears to be logged

Recommendations

Must Fix (Before Merge)

  1. Fix the race condition in #removeConn to ensure connections are properly cleaned up from KV storage

Should Fix

  1. Add error recovery for partial batch write failures
  2. Remove unused encoder variable in Cloudflare driver
  3. Resolve schema naming inconsistency

Nice to Have

  1. Add comprehensive tests for new KV operations
  2. Document the migration path and backward compatibility strategy
  3. Add JSDoc comments for the new public API methods
  4. Consider adding metrics/logging for KV operation counts to monitor performance impact

Overall Assessment

This is a well-thought-out refactoring that improves the architecture by separating connection state from actor state. The code quality is generally high with good error handling and logging. The main concern is the fire-and-forget deletion in #removeConn which could lead to data leaks. Once that's addressed, this should be good to merge.

Recommendation: Request changes for the critical race condition fix, then approve after addressing.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 9, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3420

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3420

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3420

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3420

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3420

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3420

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3420

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3420

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3420

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3420

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3420

commit: c57be74

@NathanFlurry NathanFlurry force-pushed the 11-08-chore_rivetkit_move_conns_to_separate_persisted_kv_keys branch from ae73e44 to eb0e6f2 Compare November 9, 2025 05:06
@claude
Copy link

claude bot commented Nov 9, 2025

Code Review - PR #3420: Move conns to separate persisted kv keys

Summary

This PR refactors RivetKit's connection storage from embedding all connection data in a single persisted actor state to using separate KV entries per connection. This is a significant architectural improvement that enhances scalability and performance.

Strengths

  1. Improved Architecture: Separating connections into individual KV entries is a smart design decision that:

    • Reduces write amplification (don't need to rewrite all connections when one changes)
    • Enables parallel connection updates
    • Simplifies connection lifecycle management
  2. Comprehensive Driver Updates: All actor drivers (Cloudflare Workers, File System, Engine) have been properly updated with the new KV batch API.

  3. Change Tracking: The addition of change tracking for connections (#changed, hasChanges, markSaved()) is well-designed and prevents unnecessary writes.

  4. Schema Versioning: Proper use of versioned schemas (actor-persist/v3.bare, file-system-driver/v2.bare) for backward compatibility.

Issues & Concerns

Critical Issues

  1. Cloudflare Key Encoding Bug (actor-driver.ts:215-230)

    The #uint8ArrayToKey method has a problematic assumption:

    #uint8ArrayToKey(key: Uint8Array): string {
        // Check if this is a connection key (starts with [2])
        if (key.length > 0 && key[0] === 2) {
            const connId = new TextDecoder().decode(key.slice(1));
            return `${KEYS.CONN_PREFIX}${connId}`;
        }
        // Otherwise, treat as persist data key [1]
        return KEYS.PERSIST_DATA;
    }

    Problem: This assumes all non-connection keys are persist data keys. If any future key types are added (e.g., [3], [4]), they'll silently map to PERSIST_DATA, causing data corruption.

    Fix: Add explicit validation:

    if (key.length > 0 && key[0] === 2) {
        const connId = new TextDecoder().decode(key.slice(1));
        return `${KEYS.CONN_PREFIX}${connId}`;
    } else if (key.length === 1 && key[0] === 1) {
        return KEYS.PERSIST_DATA;
    } else {
        throw new Error(`Unknown key type: ${key[0]}`);
    }
  2. Missing TextDecoder Import (actor-driver.ts:148)

    The code uses new TextDecoder() but I don't see an explicit import. While this works in browser/Cloudflare Workers environments, it's better to be explicit or use the existing encoder pattern.

  3. Unused Variable (actor-driver.ts:148)

    const encoder = new TextDecoder();

    This variable is declared but never used in kvBatchPut. Should be removed.

Performance Concerns

  1. Connection Restoration on Startup (instance.ts:402-432)

    The current implementation loads all connections sequentially:

    const connEntries = await this.#actorDriver.kvListPrefix(actorId, KEYS.CONN_PREFIX);
    for (const [_key, value] of connEntries) {
        const conn = cbor.decode(value) as PersistedConn<CP, CS>;
        connections.push(conn);
    }

    For actors with many connections (100+), this could be slow. Consider:

    • Parallel CBOR decoding with Promise.all
    • Lazy loading connections on first access
    • Connection pagination if extremely large numbers are expected
  2. Write Amplification in File System Driver

    Need to verify that the file system driver's kvBatchPut implementation uses atomic operations or proper locking to prevent partial writes if the process crashes mid-batch.

Code Quality Issues

  1. Inconsistent Error Handling (actor-driver.ts:427-435)

    In connection restoration, errors are logged but execution continues:

    try {
        const conn = cbor.decode(value) as PersistedConn<CP, CS>;
        connections.push(conn);
    } catch (error) {
        this.#rLog.error({
            msg: "failed to decode connection",
            error: stringifyError(error),
        });
    }

    Question: Should corrupted connections be silently skipped, or should this fail fast? Consider adding metrics/alerts for connection decode failures.

  2. Magic Numbers (kv.ts:2-3)

    The key prefixes use magic numbers:

    PERSIST_DATA: Uint8Array.from([1]),
    CONN_PREFIX: Uint8Array.from([2]),

    Consider using an enum for clarity:

    enum KeyPrefix {
        PERSIST_DATA = 1,
        CONN = 2,
    }
  3. Missing Type Guard (conn.ts:150-157)

    The #setupPersistProxy method checks if persist can be proxied but doesn't have a proper type guard:

    if (persist === null || typeof persist !== "object") {
        this.__persist = persist;
        return;
    }

    This could cause issues with primitives. Consider more robust checks.

Security Considerations

  1. Connection ID Validation

    Connection IDs from clients are used as KV keys. Ensure connId is properly validated to prevent:

    • Path traversal attacks (if file system driver stores as files)
    • Key collision attacks
    • DOS via extremely long keys

    Recommend adding validation in makeConnKey() to ensure connIds match expected format (UUID).

  2. CBOR Decode Safety

    Multiple places decode CBOR from storage without size limits:

    const conn = cbor.decode(value) as PersistedConn<CP, CS>;

    If storage is compromised, malicious CBOR could cause DOS. Consider adding size limits.

Testing Recommendations

  1. Add tests for:

    • Connection state persistence across actor restarts
    • Concurrent connection modifications
    • Large numbers of connections (100+)
    • Connection decode failures and error recovery
    • Key encoding/decoding edge cases in Cloudflare driver
  2. Load testing: Test the new KV-based approach with realistic connection counts to ensure performance improvements are realized.

  3. Migration testing: Verify that actors with existing embedded connections can migrate to the new format (or document that this is a breaking change requiring data migration).

Minor Issues

  1. Logging Consistency: Some logs use connId, others use id. Standardize on one.

  2. Documentation: The new KV methods in ActorDriver interface lack JSDoc comments explaining the key/value format and guarantees.

  3. Type Safety: Several as type assertions could be replaced with proper runtime validation.

Questions

  1. Is there a migration path for existing actors with embedded connection data?
  2. What's the expected maximum number of connections per actor?
  3. Should connection cleanup (for disconnected hibernatable connections) be part of this PR?
  4. Are there any atomicity guarantees needed across multiple KV operations?

Verdict

Overall this is a well-architected improvement to RivetKit's persistence layer. The main concerns are:

  • The Cloudflare key encoding bug (critical - must fix)
  • Performance optimization opportunities for connection loading
  • Missing validation and error handling in key areas

Once the critical issues are addressed, this will be a solid enhancement.


🤖 Review generated with Claude Code

@NathanFlurry NathanFlurry marked this pull request as ready for review November 9, 2025 05:44
@claude
Copy link

claude bot commented Nov 13, 2025

Pull Request Review: Move Connections to Separate Persisted KV Keys

This PR introduces a significant architectural change by moving connection data from a single monolithic persisted data structure to separate KV keys. I've reviewed the changes and identified several areas for improvement along with some potential issues.

Overview

The PR changes the persistence model from storing all actor state (including connections) in a single PERSIST_DATA key to storing connections separately with individual keys prefixed by CONN_PREFIX (byte [2]). This is a solid architectural decision for scalability and performance.


Critical Issues

1. Cloudflare Driver Key Encoding Inconsistency

Location: rivetkit-typescript/packages/cloudflare-workers/src/actor-driver.ts:216-238

The #uint8ArrayToKey and #keyToUint8Array methods have potential bugs:

#uint8ArrayToKey(key: Uint8Array): string {
    if (key.length > 0 && key[0] === 2) {
        const connId = new TextDecoder().decode(key.slice(1));
        return `${KEYS.CONN_PREFIX}${connId}`;
    }
    return KEYS.PERSIST_DATA;
}

Issues:

  • The method assumes that if key[0] !== 2, it must be persist data ([1]), but doesn't validate this assumption
  • Connection IDs are decoded as UTF-8 strings, which could fail if the connection ID contains invalid UTF-8 sequences
  • No error handling for decode failures

Recommendation:

#uint8ArrayToKey(key: Uint8Array): string {
    if (key.length === 0) {
        throw new Error("Empty key provided");
    }
    
    if (key[0] === 2) {
        // Connection key - extract connId
        try {
            const connId = new TextDecoder("utf-8", { fatal: true }).decode(key.slice(1));
            return `${KEYS.CONN_PREFIX}${connId}`;
        } catch (error) {
            throw new Error(`Failed to decode connection ID: ${error}`);
        }
    } else if (key[0] === 1) {
        // Persist data key
        return KEYS.PERSIST_DATA;
    } else {
        throw new Error(`Unknown key prefix: ${key[0]}`);
    }
}

2. Schema Version Not Added to Build Script

Location: rivetkit-typescript/packages/rivetkit/package.json:156

The build script was updated to include file-system-driver/v2.bare, but I don't see corresponding updates to ensure backward compatibility with v1. When deploying this change:

  • Do existing actors with v1 schemas need migration?
  • Is there a migration path documented?
  • Will old actors automatically upgrade to v3?

Recommendation: Add migration documentation or upgrade logic if needed.

3. Potential Race Condition in Connection State Saving

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance.ts (various locations)

The connection change tracking uses #changedConnections set, but there's no clear synchronization between:

  1. Marking a connection as changed
  2. Saving connections in batches
  3. Clearing the changed flag

If saveState is called concurrently (even though there's a queue), there could be edge cases where:

  • A connection is modified during serialization
  • The markSaved() call happens before the actual KV write completes

Recommendation: Review the synchronization model and ensure the SinglePromiseQueue properly serializes all state mutations.


Medium Priority Issues

4. Missing Error Handling in Connection Decoding

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:416-424

for (const [_key, value] of connEntries) {
    try {
        const conn = cbor.decode(value) as PersistedConn<CP, CS>;
        connections.push(conn);
    } catch (error) {
        this.#rLog.error({
            msg: "failed to decode connection",
            error: stringifyError(error),
        });
    }
}

Issue: When a connection fails to decode, it's silently skipped with only a log. This could lead to:

  • Lost connection state without user awareness
  • Orphaned KV entries
  • Inconsistent actor state

Recommendation:

  • Implement a recovery strategy (e.g., attempt to delete the corrupted key)
  • Optionally emit metrics/alerts for monitoring
  • Consider a "corrupted connections" counter to track data integrity issues

5. Unbounded KV List Operation

Location: rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:407-410

const connEntries = await this.#actorDriver.kvListPrefix(
    this.#actorId,
    KEYS.CONN_PREFIX,
);

Issue: The kvListPrefix operation has no pagination or limits. For actors with many connections, this could:

  • Exceed platform limits (Cloudflare DO has limits on list operations)
  • Cause memory issues loading all connections at once
  • Slow down actor initialization

Recommendation:

  • Add pagination support to kvListPrefix
  • Consider lazy-loading connections on-demand
  • Document maximum recommended connections per actor

6. Type Safety in Schema Conversion

Location: Throughout the schema conversion code

The bare schema type definitions have been renamed (e.g., PersistedHibernatableConnHibernatableConn in v3.bare), but the TypeScript types still use the old naming conventions. While this works, it creates confusion.

Recommendation: Consider aligning TypeScript type names with the schema names for consistency.


Low Priority / Nice to Have

7. Logging Inconsistencies

  • Some logs use this.#rLog (runtime log) and others use this.actorContext.log (user log)
  • The distinction isn't always clear when reviewing the code
  • Consider adding comments explaining when to use which logger

8. Code Duplication in Key Encoding

The key encoding logic appears in both:

  • rivetkit-typescript/packages/rivetkit/src/actor/kv.ts (makeConnKey)
  • rivetkit-typescript/packages/cloudflare-workers/src/actor-driver.ts (conversion methods)

These should ideally share the same encoding/decoding logic to prevent divergence.

9. Missing Test Coverage

Given the significant architectural change, I'd expect to see:

  • Tests for the new KV operations
  • Tests for connection persistence/restoration
  • Tests for the Cloudflare key encoding/decoding
  • Migration tests from the old format to new format

Recommendation: Add comprehensive test coverage before merging.

10. Performance Consideration

Moving to separate keys is good for isolation, but may impact performance:

  • Previously: 1 KV read to get all state
  • Now: 1 KV read for actor data + N KV reads for connections (though batched)

For actors with many connections, consider:

  • Benchmarking the new approach vs old
  • Documenting recommended connection limits
  • Potentially implementing connection pooling or caching strategies

Positive Notes

Good architectural decision - Separating connection state from actor state improves isolation and allows for better scalability

Proper use of batched KV operations - Using kvBatchPut, kvBatchGet, and kvBatchDelete minimizes round trips

Change tracking - The onChange proxy pattern for tracking connection modifications is elegant

Backward compatibility attempt - The schema versioning shows awareness of upgrade paths


Recommendations Summary

Before merging:

  1. Fix the critical key encoding issues in Cloudflare driver
  2. Add error recovery for corrupted connection data
  3. Add pagination/limits to list operations
  4. Add comprehensive test coverage
  5. Document migration path and any breaking changes
  6. Consider performance benchmarks with many connections

Future improvements:
7. Align TypeScript types with schema names
8. Centralize key encoding logic
9. Add monitoring/metrics for connection state errors
10. Document logger usage patterns

Overall, this is a solid refactoring that improves the architecture, but needs some hardening before production use.

@NathanFlurry NathanFlurry force-pushed the 11-08-chore_rivetkit_remove_deprecated_conn_tokens_sse_support branch from 06d93f5 to e109ee5 Compare November 13, 2025 02:47
@NathanFlurry NathanFlurry force-pushed the 11-08-chore_rivetkit_move_conns_to_separate_persisted_kv_keys branch from d818375 to c57be74 Compare November 13, 2025 02:47
@claude
Copy link

claude bot commented Nov 13, 2025

Code Review: Move conns to separate persisted KV keys

Overview

This PR refactors connection persistence in RivetKit by moving connection data from a single monolithic persisted state object to individual KV entries. This is a significant architectural improvement that addresses scalability and performance concerns.


Positive Aspects

1. Architecture Improvement

  • Granular persistence: Connections are now stored as separate KV entries instead of being serialized within the main actor state. This allows for:
    • More efficient updates (only changed connections are written)
    • Better scalability with large numbers of connections
    • Reduced serialization overhead on state saves

2. Clean Abstraction

The new KV batch operations (kvBatchPut, kvBatchGet, kvBatchDelete, kvListPrefix) provide a consistent interface across drivers:

  • Engine driver (src/drivers/engine/actor-driver.ts:370-418)
  • Cloudflare Workers driver (src/actor-driver.ts:149-244)
  • File system driver (src/drivers/file-system/global-state.ts)

3. Change Tracking

The proxy-based change tracking in conn.ts:139-196 is well-designed:

  • Uses onChange library to detect modifications
  • Validates CBOR serializability on state changes
  • Marks connections as dirty only when necessary
  • Prevents unnecessary writes

Issues & Concerns

1. Schema Versioning & Migration ⚠️

Critical: The schema changes from v2 to v3 rename types without providing migration logic:

In schemas/actor-persist/v3.bare:

-type PersistedHibernatableConn struct {
+type HibernatableConn struct {

Problem: Existing actors with v2 persisted data will fail to deserialize when they wake up after this deployment.

Recommendation:

  • Add migration logic in src/schemas/actor-persist/versioned.ts to handle the rename
  • OR keep the old type names for backward compatibility
  • Document the migration strategy in the commit message

Location: schemas/actor-persist/v3.bare:7-35


2. Error Handling & Data Consistency ⚠️

In src/actor/instance.ts:396-434, when restoring connections:

for (const [_key, value] of connEntries) {
    try {
        const conn = cbor.decode(value) as PersistedConn<CP, CS>;
        connections.push(conn);
    } catch (error) {
        this.#rLog.error({
            msg: "failed to decode connection",
            error: stringifyError(error),
        });
    }
}

Issues:

  1. Failed connection decodes are silently skipped - this could lead to data loss
  2. No metrics/alerting for corrupted connection data
  3. Partial restoration could leave connections in inconsistent state with hibernatableConns array

Recommendation:

  • Add a counter for failed decodes and expose as metric
  • Consider failing actor startup if critical connections fail to decode
  • Log connection IDs to help with debugging

Location: src/actor/instance.ts:419-434


3. Cloudflare Driver Key Encoding 🔴

In cloudflare-workers/src/actor-driver.ts:216-251, the key encoding logic has assumptions:

#uint8ArrayToKey(key: Uint8Array): string {
    // Check if this is a connection key (starts with [2])
    if (key.length > 0 && key[0] === 2) {
        // Connection key - extract connId
        const connId = new TextDecoder().decode(key.slice(1));
        return `${KEYS.CONN_PREFIX}${connId}`;
    }
    // Otherwise, treat as persist data key [1]
    return KEYS.PERSIST_DATA;
}

Problems:

  1. Hardcoded magic numbers: Uses 2 for connection prefix, 1 for persist data. These should reference the constants from kv.ts
  2. Assumption fallback: The else case assumes anything that's not a connection key is persist data - this is fragile and could hide bugs
  3. No validation: If an invalid key format is encountered, it silently returns the wrong key

Recommendation:

#uint8ArrayToKey(key: Uint8Array): string {
    if (key.length === 0) {
        throw new Error("Empty key");
    }
    
    if (key[0] === KEYS.CONN_PREFIX[0]) {
        const connId = new TextDecoder().decode(key.slice(1));
        return `${KEYS.CONN_PREFIX}${connId}`;
    }
    
    if (key[0] === KEYS.PERSIST_DATA[0]) {
        return KEYS.PERSIST_DATA;
    }
    
    throw new Error(`Unknown key prefix: ${key[0]}`);
}

Location: cloudflare-workers/src/actor-driver.ts:216-251


4. Race Condition in Connection Save ⚠️

In src/actor/instance.ts:1041-1075, the connection save logic:

// Save changed connections
for (const connId of this.#changedConnections) {
    const conn = this.#connections.get(connId);
    if (!conn) {
        this.#rLog.warn({
            msg: "connection marked as changed but not found",
            connId,
        });
        continue;
    }
    
    // ... save logic
}

Issue: If a connection is removed between when it's marked as changed and when the save happens, we get a warning but the KV entry might not be cleaned up.

Recommendation: Track deletion separately or ensure cleanup happens in the disconnect flow.

Location: src/actor/instance.ts:1041-1075


5. Missing File System Schema v2 ⚠️

The PR adds schemas/file-system-driver/v2.bare but I don't see:

  • Migration logic from v1 to v2
  • Updates to the versioned module to handle both versions

Recommendation: Ensure the file system driver properly handles schema versioning.

Location: schemas/file-system-driver/v2.bare:1-23


6. Code Duplication in Key Conversion

The Cloudflare driver has custom logic for converting between Uint8Array and string-based keys, while the engine driver works with Uint8Array natively. This creates:

  • Maintenance burden (logic must be kept in sync)
  • Potential for bugs if encoding/decoding doesn't match

Recommendation: Consider extracting key conversion to shared utilities.


Minor Issues

7. Unused Import

In src/drivers/engine/kv.ts:14, there appear to be removed imports that might leave unused dependencies.

8. Magic Number Documentation

The KV key prefixes ([1] for persist data, [2] for connections) should be documented more clearly. Consider adding comments explaining the key structure.

In src/actor/kv.ts:1-14:

export const KEYS = {
    PERSIST_DATA: Uint8Array.from([1]), // Main actor state
    CONN_PREFIX: Uint8Array.from([2]),  // Connection data prefix
};

9. Test Coverage

I don't see test updates in this PR. The new KV operations and connection persistence logic should have:

  • Unit tests for key encoding/decoding
  • Integration tests for connection restoration
  • Tests for edge cases (corrupted data, missing connections, etc.)

Performance Considerations

Positive

  • ✅ Batching operations (kvBatchPut, kvBatchGet) reduces round trips
  • ✅ Only saving changed connections reduces write amplification
  • ✅ Debounced saves prevent excessive writes

Concerns

  • ⚠️ kvListPrefix could be expensive with many connections - consider pagination or limits
  • ⚠️ No obvious cleanup mechanism for orphaned connection KV entries

Security

No significant security concerns identified. The changes maintain the existing trust boundary and don't introduce new attack vectors.


Recommendations Priority

  1. HIGH: Add schema migration logic for v2 → v3 transition
  2. HIGH: Fix hardcoded magic numbers in Cloudflare driver key encoding
  3. MEDIUM: Improve error handling for corrupted connection data
  4. MEDIUM: Add test coverage for new KV operations
  5. LOW: Document key structure and cleanup orphaned entries

Summary

This is a well-architected refactoring that improves scalability and performance. The main concerns are around:

  1. Schema migration/backward compatibility
  2. Error handling for edge cases
  3. Magic number usage in the Cloudflare driver

With the suggested fixes, this will be a solid improvement to the codebase.

Status: Approve with requested changes


🤖 Generated with Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 13, 2025

Merge activity

  • Nov 13, 8:47 PM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 13, 8:48 PM UTC: CI is running for this pull request on a draft pull request (#3463) due to your merge queue CI optimization settings.
  • Nov 13, 8:50 PM UTC: Merged by the Graphite merge queue via draft PR: #3463.

@graphite-app graphite-app bot closed this Nov 13, 2025
@graphite-app graphite-app bot deleted the 11-08-chore_rivetkit_move_conns_to_separate_persisted_kv_keys branch November 13, 2025 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants