Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 6, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivet-site Ready Ready Preview Comment Nov 10, 2025 10:53pm
rivetkit-serverless Ready Ready Preview Comment Nov 10, 2025 10:53pm
2 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 10, 2025 10:53pm
rivet-inspector Ignored Ignored Preview Nov 10, 2025 10:53pm

Copy link
Contributor Author

MasterPtato commented Nov 6, 2025


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 6, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3400

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3400

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3400

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3400

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3400

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3400

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3400

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3400

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3400

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3400

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3400

commit: 7239ed4

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: WebSocket Hibernation Feature

Summary

This PR implements WebSocket hibernation functionality for the pegboard gateway, allowing WebSocket connections to sleep when actors are unavailable and wake up when they become ready. Overall, the implementation is well-structured with good separation of concerns.

✅ Strengths

  1. Clear Intent: The rename from WebSocketServiceRetry to WebSocketServiceHibernate better communicates the feature's purpose
  2. Good Abstraction: The handle_websocket_hibernation trait method provides a clean interface for different implementations
  3. Proper Error Handling: The code maintains consistent error handling patterns throughout
  4. Documentation: Inline comments explain the hibernation lifecycle well (proxy_service.rs:1908-1912)

🐛 Potential Bugs & Issues

Critical Issues

1. Missing Peek Consumption After Non-Message Events (pegboard-gateway/src/lib.rs:586-611)

When Ping, Pong, or Frame messages are encountered in the hibernate_ws function, they are peeked but never consumed from the stream. This creates an infinite loop - the same message will be peeked repeatedly.

The code currently has:

match msg {
    Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
    Message::Close(_) => return Ok(HibernationResult::Close),
    // Ignore rest
    _ => {}  // BUG: Peeked message not consumed\!
}

Fix: Call pinned.try_next().await? before continuing the loop for ignored message types.

2. Unreachable Code Path (pegboard-gateway/src/lib.rs:595-596)

The bail!("should be unreachable") will never execute since try_next().await? already propagates the error. Either remove the bail! or replace it with unreachable!().

Moderate Issues

3. Race Condition in Hibernation (pegboard-gateway/src/lib.rs:561-575)

The tokio::select! will cancel the losing branch. If a message arrives at the exact moment the actor becomes ready, there's potential for edge cases. This may be acceptable behavior, but worth documenting.

4. TypeScript Typo Fixed (rivetkit-typescript/packages/rivetkit/src/actor/config.ts:74)

Good catch fixing canHibernatWebSocket → canHibernateWebSocket. Verify that documentation and examples using this API are also updated if needed.

⚡ Performance Considerations

Lock Contention: The receiver is wrapped in Arc<Mutex<>>, and during hibernation, the mutex will be held for potentially long periods while waiting for messages. This is likely acceptable since WebSocket handling is typically single-threaded per connection.

Busy Loop Potential: Due to bug #1, ignored message types will cause a tight busy loop. Once fixed, the loop will properly block on peek().

🔒 Security Concerns

1. Close Frame Handling During Hibernation

When a client sends a Close frame during hibernation, the connection terminates gracefully. Ensure that:

  • Resources are properly cleaned up when returning HibernationResult::Close
  • The actor lifecycle is notified that the connection was closed
  • No state corruption occurs if the actor becomes ready after the close

2. Timeout Handling

I don't see an explicit timeout for the hibernation period. Consider:

  • What happens if an actor never becomes ready?
  • Should there be a maximum hibernation duration?

🧪 Test Coverage

Missing Test Coverage (concerning):

  • No tests for the new hibernation functionality
  • No tests for hibernate_ws function behavior
  • No tests for the peek/consume logic
  • No tests for TypeScript config changes

Recommendations:

  1. Add unit tests for hibernate_ws with various message types
  2. Add integration tests for:
    • Actor becomes ready during hibernation
    • Client sends message during hibernation
    • Client closes connection during hibernation
    • Multiple hibernation cycles
  3. Test edge cases (actor never becomes ready, rapid hibernate/wake cycles)

📝 Code Quality & Best Practices

Import Style (guard/src/routing/api_public.rs:4)

The code uses use anyhow::{Context, Result}; but CLAUDE.md states: "Import anyhow using use anyhow::* instead of importing individual types"

Consider changing to use anyhow::*; for consistency.

Formatting & Logging: Good adherence to project standards with hard tabs and structured logging.

🎯 Summary of Action Items

Must Fix (Blocking):

  1. Fix infinite loop bug when ignoring Ping/Pong messages
  2. Remove or fix unreachable code in error handling path

Should Fix (High Priority):

  1. Add comprehensive test coverage for hibernation functionality
  2. Document or handle the race condition in tokio::select!
  3. Consider adding hibernation timeout limits

Nice to Have:

  1. Fix import style to match CLAUDE.md conventions
  2. Add explicit resource cleanup verification for close-during-hibernation
  3. Document the hibernation lifecycle in code comments or docs

Overall Assessment

This is a well-architected feature with good separation of concerns. The main concerns are:

  1. The critical bug in message consumption
  2. Lack of test coverage
  3. Potential edge cases around timing and resource cleanup

Once these issues are addressed, this will be a solid addition to the codebase. Great work on improving the semantics and structure of the WebSocket retry/hibernation logic!

@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: WebSocket Hibernation Feature

Summary

This PR implements WebSocket hibernation functionality, allowing WebSocket connections to pause when the downstream actor is not ready and resume when it becomes available. This is a significant architectural improvement for handling actor lifecycle management.

✅ Strengths

1. Clean Architecture

  • The new handle_websocket_hibernation trait method provides a clear extension point for custom hibernation behavior
  • Separation of concerns between retry logic and hibernation logic is well-designed
  • The rename from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the actual semantics

2. Smart Peekable Implementation

The use of Peekable stream wrapper (websocket_handle.rs:11-12) is clever:

pub type WebSocketReceiver =
	Peekable<futures_util::stream::SplitStream<WebSocketStream<TokioIo<Upgraded>>>>;

This allows peeking at messages during hibernation without consuming them, which is critical for the hibernation behavior.

3. Proper Error Handling

The hibernate_ws function (pegboard-gateway/src/lib.rs:586-611) correctly handles edge cases:

  • Differentiates between message arrival, close frames, and errors
  • Uses peek to avoid consuming messages that will be processed after hibernation ends

4. TypeScript Integration

The changes to actor/instance.ts:2072-2073 properly implement the hibernatable connection check, enabling actors to sleep when only hibernatable WebSockets are active.


🐛 Potential Issues

1. Logic Bug in hibernate_ws Function ⚠️

Location: pegboard-gateway/src/lib.rs:590-611

The current implementation has an infinite loop issue:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        let Ok(msg) = msg else {
            pinned.try_next().await?;
            bail!("should be unreachable");
        };
        match msg {
            Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
            Message::Close(_) => return Ok(HibernationResult::Close),
            // Ignore rest
            _ => {}  // ⚠️ This continues the loop without consuming the message!
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

Problem: When receiving Message::Ping, Message::Pong, or Message::Frame, the code hits the _ => {} branch and loops infinitely on the same peeked message.

Suggested Fix:

loop {
    if let Some(msg) = pinned.as_mut().peek().await {
        let Ok(msg) = msg else {
            pinned.try_next().await?;
            bail!("should be unreachable");
        };
        match msg {
            Message::Binary(_) | Message::Text(_) => return Ok(HibernationResult::Message),
            Message::Close(_) => return Ok(HibernationResult::Close),
            // Consume and ignore ping/pong/frame messages
            _ => {
                pinned.try_next().await?;
            }
        }
    } else {
        return Ok(HibernationResult::Close);
    }
}

2. Typo in TypeScript 🔤

Location: rivetkit-typescript/packages/rivetkit/src/actor/config.ts:74

The option name was corrected from canHibernatWebSocket to canHibernateWebSocket, but this is a breaking change for any existing code using the old spelling. Consider:

  • Adding a deprecation notice
  • Supporting both spellings temporarily with a console warning
  • Documenting this in release notes

3. Missing Test Implementation 🧪

Location: guard-core/tests/custom_serve.rs:106-108

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<()> {
    todo!();
}

The test implementation uses todo!(), which means tests will panic if hibernation is actually triggered. This should either:

  • Be implemented with a no-op or test-specific behavior
  • Have tests added that explicitly test hibernation behavior

Recommendation: Add a test that:

  1. Establishes a WebSocket connection
  2. Triggers hibernation (returns WebSocketServiceHibernate error)
  3. Verifies the connection stays open
  4. Sends a message and verifies the connection resumes

🔒 Security Considerations

1. Resource Exhaustion

During hibernation, the WebSocket connection stays open indefinitely while waiting for either:

  • Actor to become ready
  • Client to send a message

Concern: A malicious client could open many WebSocket connections and leave them idle, consuming server resources.

Recommendation: Consider adding a configurable maximum hibernation timeout to prevent indefinite resource holding.

2. Actor Ready Event Subscription

let mut ready_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Ready>(("actor_id", self.actor_id))
    .await?;

Question: What happens if the actor was already ready before the subscription? Could this cause a race condition where the hibernation never ends?

Recommendation: Check actor readiness before subscribing to ensure we don't miss the ready event.


⚡ Performance Considerations

1. Lock Contention on WebSocket Receiver

The hibernate_ws function holds the ws_rx lock for the entire hibernation period:

let mut guard = ws_rx.lock().await;

This prevents other parts of the code from accessing the receiver. This appears intentional and correct for this use case, but worth noting.

2. Multiple Route Resolutions

After hibernation, the code re-resolves the route (proxy_service.rs:1955-2014). This is correct behavior, but for high-frequency hibernation scenarios, consider caching route resolution results with a short TTL.


📝 Code Quality

1. Logging Consistency

Good use of structured logging throughout:

tracing::debug!("actor has become ready during hibernation");

Follows the CLAUDE.md guidelines properly.

2. Error Context

Good use of .context() in api_public.rs:

.context("failed to call api-public service")?;

Follows error handling patterns correctly.

3. Comment Quality 👍

Excellent inline documentation:

// After this function returns:
// - the route will be resolved again
// - the websocket will connect to the new downstream target
// - the gateway will continue reading messages from the client ws
//   (starting with the message that caused the hibernation to end)

🧪 Test Coverage

Current: Basic smoke tests exist but hibernation-specific tests are missing.

Recommended Additional Tests:

  1. Test hibernation triggered by actor not ready
  2. Test wake-up on actor ready event
  3. Test wake-up on client message during hibernation
  4. Test close frame during hibernation
  5. Test timeout scenarios (if implemented)
  6. Test concurrent hibernation of multiple WebSockets

📊 Overall Assessment

Code Quality: 8/10
Architecture: 9/10
Security: 7/10 (needs timeout protection)
Test Coverage: 5/10 (needs hibernation-specific tests)
Documentation: 8/10

Verdict: Approve with Minor Changes

This is a well-architected feature with good separation of concerns. The primary issue is the infinite loop bug in hibernate_ws that needs to be fixed before merging. Once that's addressed, this is a solid implementation.


📋 Action Items

Must Fix Before Merge:

  • Fix infinite loop in hibernate_ws for ping/pong messages

Should Address:

  • Implement test hibernation behavior (remove todo!())
  • Consider adding maximum hibernation timeout
  • Add hibernation-specific integration tests

Nice to Have:

  • Document breaking change in TypeScript config option name
  • Add race condition check for actor ready state
  • Consider caching route resolution results

Great work on this feature! The hibernation mechanism is a clever solution to the actor lifecycle problem. 🎉

@claude
Copy link

claude bot commented Nov 6, 2025

WebSocket Hibernation PR Review

This PR implements WebSocket hibernation, allowing actors to sleep while maintaining WebSocket connections by pausing message processing until the actor becomes ready again. Overall, the implementation is well-structured and follows good practices. Here are my findings:

✅ Strengths

  1. Clean Architecture: The hibernation abstraction is well-designed with the handle_websocket_hibernation trait method, allowing different services to implement their own hibernation logic.

  2. Proper Error Handling: Changed from WebSocketServiceRetry to WebSocketServiceHibernate error provides better semantic clarity about what's happening.

  3. Smart Use of peek(): Using Peekable stream allows inspecting messages without consuming them, which is exactly what's needed for hibernation - check if there's a message to wake up, but don't consume it yet.

  4. Good Logging: Structured logging throughout makes debugging easier (e.g., tracing::debug!(?err, "websocket handler error")).

  5. Configuration Flexibility: The TypeScript side allows both boolean and function-based canHibernateWebSocket configuration for fine-grained control.

  6. Documentation Update: Good catch updating CLAUDE.md to discourage glob imports from anyhow.

🐛 Potential Issues

  1. Grammar in Comment (pegboard-gateway/src/lib.rs:601)

    // We don't care about the close frame because were currently hibernating

    Should be: we're not were

  2. Unreachable Code Logic (pegboard-gateway/src/lib.rs:595-596)

    pinned.try_next().await?;
    bail!("should be unreachable");

    The bail! is indeed unreachable since try_next().await? will propagate the error. Consider either:

    • Remove the bail (rely on ? operator)
    • Add a comment explaining why you're being defensive
    • Use unreachable!() macro instead
  3. Missing Error Context in Glob Import Check
    The CLAUDE.md update is good, but there are still 5 files using use anyhow::* in the codebase. While not part of this PR's scope, it might be worth addressing in a follow-up.

  4. Test Coverage
    The test in guard-core/tests/custom_serve.rs:106 uses todo!() for handle_websocket_hibernation. While this is acceptable for tests that don't exercise this path, consider adding a comment explaining when this needs proper implementation.

🔍 Code Style & Conventions

Good adherence to repository conventions:

  • ✅ Uses structured logging correctly (e.g., tracing::debug!(?err, ...) not tracing::debug!("foo {err}")
  • ✅ Proper error handling with anyhow Result types
  • ✅ Imports follow the pattern (individual imports, not glob)
  • ✅ Lowercase log messages
  • ✅ Good variable naming

Minor observations:

  • Consider adding a workspace-level lint to prevent use anyhow::* in the future
  • The refactoring in proxy_service.rs where handlershandler improves clarity

🎯 Performance Considerations

  1. Lock Holding: In hibernate_ws(), the mutex guard is held throughout the entire peek/next loop. This is appropriate since we need exclusive access, but worth noting that this will block any concurrent attempts to read from the WebSocket.

  2. Polling Strategy: The function loops with peek() which could be CPU-intensive if called frequently. However, since this only runs during hibernation (a relatively rare state), this should be fine.

🔒 Security Considerations

No security concerns identified. The hibernation mechanism properly:

  • Handles close frames during hibernation
  • Validates message types before processing
  • Uses proper error propagation

✨ Suggestions for Future Enhancements

  1. Add metrics/telemetrics for hibernation events (hibernate start/end, duration, etc.)
  2. Consider timeout for hibernation to prevent indefinite waiting
  3. Document the message ordering guarantees during hibernation handoff

📊 Test Coverage Assessment

The PR modifies critical WebSocket handling logic but doesn't add new tests. Consider adding:

  • Test for hibernation → wake cycle
  • Test for client disconnection during hibernation
  • Test for messages arriving during hibernation

Summary

This is a solid implementation of an important feature. The code is clean, well-structured, and follows repository conventions. The main issues are minor (grammar, unreachable code comment) and easily addressed.

Recommendation: Approve with minor suggested fixes.

Great work! 🚀

@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: WebSocket Hibernation Implementation

Thank you for implementing WebSocket hibernation! This is a well-structured feature that enables actors to sleep while maintaining WebSocket connections. Here's my detailed review:

✅ Strengths

  1. Clean Error Handling Refactor: The renaming from WebSocketServiceRetry to WebSocketServiceHibernate better reflects the actual behavior and makes the intent clearer.

  2. Smart Design: The hibernation logic properly handles the lifecycle:

    • Uses Peekable stream to inspect messages without consuming them
    • Subscribes to actor Ready events to wake up when the backend is available
    • Properly distinguishes between messages that should wake the connection vs. ignore (ping/pong)
  3. TypeScript Integration: The typo fix (canHibernatWebSocketcanHibernateWebSocket) and proper implementation in the actor instance shows good attention to detail.

  4. Code Style Compliance: Changes follow the CLAUDE.md guidelines (structured logging, lowercase log messages, etc.)

🔍 Issues & Concerns

Critical: anyhow Glob Imports Still Present

While CLAUDE.md was updated to prohibit glob imports from anyhow, several files still use use anyhow::*;:

// Found in these files:
engine/packages/guard-core/src/websocket_handle.rs:1
engine/packages/guard-core/src/custom_serve.rs:1  
engine/packages/guard-core/src/server.rs:10

Recommendation: Update these to explicit imports like use anyhow::{Context, Result, bail}; to match the updated style guide and the good example in proxy_service.rs:1.

Potential Race Condition in hibernate_ws

In pegboard-gateway/src/lib.rs:592-614, the hibernation loop peeks at messages and consumes non-data messages:

_ => {
    pinned.try_next().await?;  // Line 607
}

Issue: If the stream is Peekable, calling try_next() consumes the peeked message. However, there's a subtle race: if a Binary/Text message arrives between the peek() check and when we return to the main handler, we might miss it or process it twice.

Recommendation: Consider whether the logic should:

  1. Always consume the peeked message after inspecting it, OR
  2. Document clearly that the main handler will see the same message we peeked at

Missing Test Coverage

The test in guard-core/tests/custom_serve.rs:106 implements the new trait method with todo!():

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<bool> {
    todo!();
}

Issue: This will panic if the test path exercises hibernation.

Recommendation: Either:

  1. Implement a proper test stub (return Ok(true) or Ok(false))
  2. Add integration tests that verify the hibernation behavior end-to-end
  3. At minimum, return a reasonable default like Ok(true) with a comment

TypeScript TODO Comment

In rivetkit-typescript/packages/rivetkit/src/actor/instance.ts:2065, there's a stale TODO:

// TODO: When WS hibernation is ready, update this to only count non-hibernatable websockets

Recommendation: Since hibernation is now implemented (line 2072 checks !conn.isHibernatable), this TODO should be removed or updated to reflect that hibernation IS ready but raw WebSockets still need work.

🎯 Minor Suggestions

  1. Documentation: Consider adding inline comments in handle_websocket_hibernation explaining the lifecycle:

    • What triggers hibernation end (actor ready OR client message)
    • What happens when we return false vs true
    • The comment at proxy_service.rs:1908-1912 is good but could be referenced from the trait definition
  2. Logging Consistency: In pegboard-gateway/src/lib.rs:563, the log uses "has become ready" but line 570 uses "received message during hibernation, reconnecting". Consider making both explicitly mention "hibernation" for easier log filtering:

    tracing::debug!("actor became ready during hibernation, reconnecting");
  3. Error Context: The hibernate_ws function could benefit from more context on errors:

    pinned.try_next().await.context("failed to consume non-data websocket message during hibernation")?;

🔒 Security & Performance

  • No security concerns identified: The hibernation doesn't introduce auth bypasses or data leaks
  • Performance: Using Peekable adds minimal overhead and the hibernation pattern should significantly reduce actor resource usage
  • Resource Management: Properly handles cleanup on close and timeout scenarios

📊 Overall Assessment

This is a solid implementation of a complex feature. The core logic is sound and the integration across Rust and TypeScript layers is well-executed. The main concerns are:

  1. Must fix: Remove glob imports from anyhow per updated style guide
  2. Should fix: Address the todo!() in tests and verify peek/consume behavior
  3. Nice to have: Update stale TODOs and improve documentation

The feature appears production-ready once the anyhow imports are cleaned up and test coverage is verified.

Recommendation: Approve with minor changes requested (fix anyhow imports and test stub).

@claude
Copy link

claude bot commented Nov 6, 2025

Code Review: WebSocket Hibernation Feature

This PR implements WebSocket hibernation to allow actors to sleep while maintaining client connections. Overall, the implementation is solid with good attention to detail. Here are my findings:

✅ Strengths

  1. Well-structured feature: The hibernation mechanism cleanly separates concerns between detecting hibernation, handling it, and reconnecting.

  2. Good error handling: Proper use of custom error types (WebSocketServiceHibernate) following the project's error conventions in engine/packages/guard-core/src/errors.rs.

  3. Smart use of Peekable: The switch to Peekable<WebSocketReceiver> in engine/packages/guard-core/src/websocket_handle.rs:11-12 allows peeking at messages without consuming them - crucial for hibernation.

  4. Proper lifecycle management: The handle_websocket_hibernation trait method returns bool to indicate if the connection should close, giving implementations full control.

  5. Documentation: Good inline comments explaining the hibernation flow in proxy_service.rs:1908-1912.

🐛 Potential Issues

1. Missing Mutex Lock Semantics (High Priority)

In pegboard-gateway/src/lib.rs:592-614, the hibernate_ws function:

async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
    let mut guard = ws_rx.lock().await;
    let mut pinned = std::pin::Pin::new(&mut *guard);
    
    loop {
        if let Some(msg) = pinned.as_mut().peek().await {
            // ...

Issue: The mutex lock is held for the entire duration of hibernation (potentially indefinitely). If handle_websocket_hibernation is called again with the same WebSocketHandle, it will deadlock.

Fix: Consider releasing the lock between peek operations or documenting that concurrent calls are not supported.

2. Incomplete Test Coverage (High Priority)

The test file engine/packages/guard-core/tests/custom_serve.rs:105-107 has:

async fn handle_websocket_hibernation(&self, _websocket: WebSocketHandle) -> Result<bool> {
    todo!();
}

Issue: The test implementation uses todo!() which will panic if called.

Recommendation: Either implement a proper test or mark the test as #[should_panic] with documentation explaining this is not testable in this context.

3. Race Condition Potential (Medium Priority)

In pegboard-gateway/src/lib.rs:555-584, the handle_websocket_hibernation uses tokio::select!:

let close = tokio::select! {
    _ = ready_sub.next() => { false }
    hibernation_res = hibernate_ws(client_ws.recv()) => { /* ... */ }
};

Issue: If the actor becomes ready while a message is being peeked, there's a small window where the message could be lost or processed out of order.

Recommendation: Consider whether message ordering guarantees are maintained when waking from hibernation.

4. Attempt Counter Reset Logic (Low Priority)

In proxy_service.rs:1901-1905:

if ws_hibernate {
    attempts = 0;
} else {
    attempts += 1;
}

Observation: Hibernation resets the attempt counter. This is probably correct, but ensure this doesn't allow infinite retries if hibernation repeatedly fails.

🔍 Code Quality

  1. Follows project conventions: ✅

    • Uses structured logging: `tracing::debug!(?err, "...")\
    • Proper error handling with anyhow::Result
    • Follows naming conventions (snake_case, descriptive names)
  2. Import management: ✅

    • CLAUDE.md update correctly changes from glob import to explicit imports
    • Example in api_public.rs:3: use anyhow::{Context, Result};
  3. Variable naming: ✅

    • Changed handlershandler (singular) in proxy_service.rs:1831 for clarity

🔒 Security Considerations

  1. No obvious security issues - hibernation doesn't expose sensitive data
  2. Proper close frame handling - connection cleanup appears correct
  3. No DoS vector - hibernation is opt-in via canHibernateWebSocket

⚡ Performance Considerations

  1. Mutex contention: The long-held lock in hibernate_ws could impact performance under high connection counts
  2. Memory usage: Hibernated connections maintain state - ensure this is bounded
  3. Subscription overhead: Each hibernation creates a new subscription (pegboard-gateway/src/lib.rs:556-559) - consider pooling

📝 Minor Suggestions

  1. Typo fix already included: ✅ canHibernatWebSocketcanHibernateWebSocket (good catch!)

  2. Consider adding metrics: Track hibernation events, wake-ups, and duration for observability

  3. Documentation: Add a doc comment to handle_websocket_hibernation explaining the return value semantics more clearly

🧪 Test Coverage

Missing:

  • Unit tests for hibernate_ws function
  • Integration tests for the full hibernation cycle
  • Tests for race conditions (actor ready vs message arrival)
  • Tests for error cases (hibernation timeout, connection drops during hibernation)

Recommendation: Add at least one integration test demonstrating the hibernation flow works end-to-end.

Summary

This is a well-implemented feature that follows the codebase conventions. The main concerns are:

  1. Fix the mutex lock semantics in hibernate_ws to prevent deadlocks
  2. Complete the test implementation or document why it's incomplete
  3. Add integration tests for the hibernation feature
  4. Consider the race condition between actor ready events and message arrival

The code is production-ready after addressing the high-priority issues above.

export const chatRoom = actor({
options: {
sleepTimeout: 2000,
canHibernateWebSocket: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed – this is for onWebSocket

Copy link
Contributor Author

@MasterPtato MasterPtato Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i removed this line but i still dont know what you mean


false
}
hibernation_res = hibernate_ws(client_ws.recv()) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this consume the message? or does this get written to the buffer + replayed? either way, needs a comment on what happens to the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none of the logic related to websocket messages is present here, and in the hibernate_ws fn it uses .peek() which implies its not consuming the message


if attempts > max_attempts
|| (!is_retryable_ws_error(&err) && !ws_retry)
if ws_hibernate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we handle failure cases if we fail to re-wake the actor when the client disconnects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do anything if the client disconnects. When we add waking and if it fails to wake the only thing we can do is log the error

@claude
Copy link

claude bot commented Nov 7, 2025

PR Review: WebSocket Hibernation Feature

This PR implements WebSocket hibernation support, allowing actors to sleep when there are hibernatable WebSocket connections. Overall, the implementation is well-structured and follows good patterns. Here's my detailed feedback:


Strengths

  1. Clean Architecture: The separation of concerns between CustomServeTrait, proxy_service, and pegboard-gateway is well done. The new HibernationResult enum provides a clear API.

  2. Good Error Handling: The rename from WebSocketServiceRetry to WebSocketServiceHibernate is more semantically accurate and follows the error handling patterns in the codebase.

  3. Comprehensive Testing: The test in guard-core/tests/custom_serve.rs:261-320 demonstrates the hibernation flow well with proper setup and assertions.

  4. Proper Use of Peekable Stream: The change to make WebSocketReceiver a Peekable stream (websocket_handle.rs:10-11) is smart - it allows checking for messages without consuming them during hibernation.

  5. Follows CLAUDE.md Conventions:

    • Uses individual imports instead of glob imports from anyhow ✅
    • Lowercase log messages ✅
    • Structured logging with field parameters ✅

🔍 Issues & Concerns

Critical Issues

  1. Race Condition in ws_hibernation_close Flag (proxy_service.rs:1914-1935)

    The logic prevents hibernating twice after receiving a close frame, but there's a potential race:

    ensure!(
        !ws_hibernation_close,
        "should not be hibernating again after receiving a close frame during hibernation"
    );

    Issue: If the actor goes down again immediately after reconnection but before handling the close, this ensure will fail. Consider whether this should be a warning log instead of a hard failure, or add more context about why this is unreachable.

  2. Missing Test Coverage

    The hibernation test (test_custom_serve_websocket_hibernation) doesn't test several important scenarios:

    • What happens when the client sends a close frame during hibernation?
    • What happens if the actor never becomes ready during hibernation?
    • What happens with binary messages during hibernation?
    • Error cases in the hibernation flow

Medium Priority Issues

  1. Potential Deadlock in hibernate_ws (pegboard-gateway/src/lib.rs:590-612)

    async fn hibernate_ws(ws_rx: Arc<Mutex<WebSocketReceiver>>) -> Result<HibernationResult> {
        let mut guard = ws_rx.lock().await;
        // ... holds lock for entire hibernation period
    }

    Issue: This function holds the mutex lock on ws_rx for the entire hibernation period. If another part of the code needs to access the receiver during hibernation, it will block. Consider whether this is intentional or if the lock should be released/reacquired.

  2. Infinite Loop Without Timeout (pegboard-gateway/src/lib.rs:594-611)

    The hibernate_ws function has an infinite loop that only exits when a Text/Binary/Close message is received. Ping/Pong frames are consumed but don't exit hibernation.

    Question: Should there be a timeout or maximum number of ping/pong frames to prevent an infinite loop if only control frames are received?

  3. Comment Accuracy (custom_serve.rs:41)

    /// Returns true if the websocket should close.
    

    Issue: The comment says "returns true" but the function returns HibernationResult enum. Should be:

    /// Returns whether the websocket should continue or close after hibernation.
    
  4. Typo in TypeScript (rivetkit-typescript/packages/rivetkit/src/actor/config.ts:74)

    Fixed typo: canHibernatWebSocketcanHibernateWebSocket

    However, this is a breaking change for any users relying on the old property name. Consider:

    • Adding a deprecation warning for the old name
    • Supporting both names temporarily with a migration period

Minor Issues

  1. Inconsistent Variable Naming (proxy_service.rs:1986)

    Ok(ResolveRouteOutput::CustomServe(new_handler)) => {
        handler = new_handler;

    The variable is named new_handler in the match arm but immediately replaces handler. This is fine, but updated_handler might be clearer about the relationship.

  2. Log Message Could Be More Specific (pegboard-gateway/src/lib.rs:394)

    tracing::debug!("tunnel sub closed");

    Consider adding context: tracing::debug!("tunnel subscription closed, initiating hibernation");

  3. Test Timing Dependency (custom_serve.rs:297)

    tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;

    This hardcoded sleep could be flaky in CI environments. Consider using a synchronization primitive or retrying the assertion with a timeout.


🎯 Suggested Improvements

  1. Add Integration Test for Close During Hibernation

    #[tokio::test]
    async fn test_websocket_close_during_hibernation() {
        // Test that sending a close frame during hibernation
        // properly reconnects and notifies the actor
    }
  2. Consider Adding Metrics

    Given the metrics module is imported in proxy_service.rs, consider adding metrics for:

    • Number of websocket hibernations
    • Duration of hibernations
    • Hibernations that ended in close vs continue
  3. Document the Hibernation Flow

    The comments in proxy_service.rs:1919-1923 are excellent. Consider adding similar documentation to the CustomServeTrait definition explaining the full lifecycle.

  4. Add Timeout to hibernate_ws

    const HIBERNATION_MAX_DURATION: Duration = Duration::from_secs(300); // 5 minutes
    
    tokio::select! {
        result = hibernate_loop(pinned) => result,
        _ = tokio::time::sleep(HIBERNATION_MAX_DURATION) => {
            Err(anyhow!("hibernation timeout exceeded"))
        }
    }

📊 Performance Considerations

  1. Lock Contention: The Arc<Mutex<WebSocketReceiver>> in hibernate_ws could be a bottleneck under high concurrency. Profile to ensure it's not a problem.

  2. Memory Usage: Each hibernating websocket holds onto its full state. Monitor memory usage with many concurrent hibernating connections.


🔒 Security Considerations

  1. No Significant Security Issues: The hibernation mechanism doesn't introduce obvious security vulnerabilities.

  2. DoS Consideration: Ensure there are limits on:

    • Maximum number of concurrent hibernating websockets per actor
    • Maximum hibernation duration (currently unlimited)

📝 Documentation Updates

The PR includes extensive documentation updates in website/public/llms-full.txt for Kubernetes deployment. These look good but are unrelated to the websocket hibernation feature - consider moving to a separate PR for clarity.


Test Coverage Assessment

Current Coverage: Basic happy path is tested
Missing Coverage:

  • Close frame during hibernation
  • Actor never becomes ready
  • Multiple hibernation cycles
  • Error handling in hibernation
  • Concurrent hibernations

Recommendation: Add at least 2-3 more test cases covering error scenarios.


🎉 Conclusion

This is a well-implemented feature that follows the codebase conventions. The main concerns are:

  1. Potential race condition with ws_hibernation_close
  2. Missing timeout in hibernate_ws
  3. Need for additional test coverage
  4. Breaking change in TypeScript API

Recommendation: Address the critical issues and consider the suggested improvements before merging. The core implementation is solid.


Generated with Claude Code

Copy link
Member

@MasterPtato MasterPtato marked this pull request as ready for review November 8, 2025 02:17
@NathanFlurry NathanFlurry marked this pull request as draft November 8, 2025 02:18
@MasterPtato MasterPtato force-pushed the 11-05-feat_gateway_websocket_hibernation branch from 8cef849 to 7239ed4 Compare November 10, 2025 22:49
@MasterPtato MasterPtato marked this pull request as ready for review November 10, 2025 22:50
@claude
Copy link

claude bot commented Nov 10, 2025

Code Review: WebSocket Hibernation Feature

This PR implements WebSocket hibernation, allowing actors to sleep while maintaining WebSocket connections. Overall, this is a well-implemented feature with good test coverage. Here's my detailed feedback:

Strengths ✅

  1. Clear Architecture: The separation between handle_websocket and handle_websocket_hibernation is well-designed
  2. Good Error Handling: Proper use of custom errors (WebSocketServiceHibernate) following the project's error patterns
  3. Test Coverage: The new hibernation test in custom_serve.rs validates the core functionality
  4. Backward Compatibility: Default implementations with bail! ensure services opt-in to hibernation support
  5. Proper Logging: Good use of structured logging with tracing throughout

Issues & Concerns 🔍

1. Potential Race Condition in hibernate_ws (pegboard-gateway/src/lib.rs:590-612)

The hibernate_ws function uses peek() to check messages without consuming them. However, after peeking and detecting a text/binary message, the message is NOT consumed before returning HibernationResult::Continue. This means the same message will be processed again after reconnection.

Current flow:

  • Peek at message → detect text/binary → return Continue
  • Handler reconnects → processes the SAME message again (duplicate processing)

Recommendation: Either:

  • Consume the peeked message before returning, OR
  • Document this behavior clearly if it's intentional

2. Infinite Loop Risk in hibernate_ws (pegboard-gateway/src/lib.rs:594)

The function has a loop that only exits on text/binary/close messages. If the stream produces endless ping/pong/frame messages, this could loop indefinitely consuming them with try_next().await.

Recommendation: Add a timeout or limit on iterations, or at minimum document the assumption that ping/pong frames are bounded.

3. Incomplete Raw WebSocket Hibernation (actor/instance.ts:2067)

There's a TODO comment indicating raw websockets don't support hibernation yet:

// TODO: When WS hibernation is ready, update this to only count non-hibernatable websockets
if (this.#activeRawWebSockets.size > 0)
    return CanSleep.ActiveRawWebSockets;

Question: Is this intentional for this PR? Should there be a follow-up task to implement raw websocket hibernation?

4. Missing Error Context (pegboard-gateway/src/lib.rs:570)

hibernation_res = hibernate_ws(client_ws.recv()) => {
    let res = hibernation_res?;  // Error bubbles up without context

Recommendation: Add context using .context("failed during websocket hibernation")?;

5. Documentation Gap

The HibernationResult enum and handle_websocket_hibernation trait method lack doc comments explaining:

  • When each variant should be used
  • What happens after returning each variant
  • The lifecycle of the websocket during hibernation

6. Typo in Config Field Name (Fixed but worth noting)

Good catch fixing canHibernatWebSocketcanHibernateWebSocket. Make sure this is properly communicated to users if it's a breaking change.

Security Considerations 🔒

  1. WebSocket State Persistence: During hibernation, the websocket state (including any buffered messages) persists in memory. Ensure there are memory limits to prevent resource exhaustion if an actor hibernates with large buffered messages.

  2. Hibernation Timeout: The test uses HIBERNATION_TIMEOUT = 2 seconds. Is there a production timeout limit? Long hibernations could be used for resource exhaustion attacks.

Performance Considerations ⚡

  1. Mutex Contention: WebSocketHandle wraps both tx and rx in Arc<Mutex<>>. During high-throughput scenarios, this could cause contention. Consider if the receiver mutex is necessary given the peek pattern.

  2. Route Re-resolution: After hibernation, the route is re-resolved. If many actors hibernate simultaneously (e.g., during a deploy), this could cause a thundering herd on the routing system.

Minor Issues 🔧

  1. proxy_service.rs:1914-1917: The ensure! check for double-hibernation is good, but the error message could be more actionable. Consider logging the state that led to this condition.

  2. Inconsistent Variable Naming: In proxy_service.rs, the handler variable changes from handlers (plural) to handler (singular). While this makes semantic sense (it's a single handler), ensure all references are updated consistently.

  3. Missing Test Cases:

    • Test hibernation followed by a close frame from client
    • Test multiple hibernation cycles in one connection
    • Test hibernation timeout behavior
    • Test actor failing to become ready during hibernation

Suggestions 💡

  1. Add Integration Tests: The current test is unit-level. Consider adding an integration test that hibernates a real actor and verifies end-to-end behavior.

  2. Metrics: Consider adding metrics for:

    • Number of hibernations
    • Hibernation duration
    • Messages received during hibernation
  3. Documentation: Add a doc comment to CustomServeTrait explaining the hibernation lifecycle and how handle_websocket and handle_websocket_hibernation interact.

Conclusion 📝

This is a solid implementation of a complex feature. The main concerns are around the message consumption logic in hibernate_ws and potential edge cases around concurrent hibernations. Once these are addressed, this should be good to merge.

Recommendation: Request changes for issues #1 and #2 above, rest are suggestions for improvement.


🤖 Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants