Skip to content

Conversation

@kthui
Copy link
Contributor

@kthui kthui commented Dec 3, 2025

Overview:

Extend existing vLLM Migration Tests into SGLang.

Details:

vLLM migration under graceful shutdown happened in < 1 sec, but SGLang took > 50 secs, which will be investigated in a future ticket.

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

Release Notes

  • Tests

    • Added comprehensive test coverage for fault-tolerance and worker migration scenarios, including failure handling and graceful shutdown validation.
  • Chores

    • Adjusted maximum token configuration for completion requests.

✏️ Tip: You can customize this high-level summary in your review settings.

@kthui kthui self-assigned this Dec 3, 2025
@github-actions github-actions bot added the test label Dec 3, 2025
@kthui kthui marked this pull request as ready for review December 3, 2025 02:23
@kthui kthui requested review from a team as code owners December 3, 2025 02:23
kthui added 2 commits December 2, 2025 18:25
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

Walkthrough

This PR introduces a new SGLang fault-tolerance migration test module with a DynamoWorkerProcess class for worker lifecycle management and four end-to-end migration test cases covering worker failure, graceful shutdown, and disabled migration scenarios. Additionally, a token limit parameter is adjusted in test utilities.

Changes

Cohort / File(s) Summary
SGLang Migration Test Module
tests/fault_tolerance/migration/test_sglang.py
New test module introducing DynamoWorkerProcess class (extends ManagedProcess) with process initialization, environment setup, log cleanup, health checks, and get_pid() / is_ready() methods. Adds four end-to-end test functions: test_request_migration_sglang_worker_failure(), test_request_migration_sglang_graceful_shutdown(), test_no_request_migration_sglang_worker_failure(), and test_no_request_migration_sglang_graceful_shutdown() evaluating migration behavior under worker failure, graceful termination, and disabled migration conditions.
Token Limit Adjustment
tests/fault_tolerance/migration/utils.py
Reduced maximum tokens for completion requests from 8192 to 8000.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • DynamoWorkerProcess class: Verify process lifecycle management, initialization logic, environment variable setup, and health check implementation for correctness
  • Test functions: Validate migration detection logic, request routing assertions, worker termination handling, and expected failure/success paths across four distinct scenarios
  • Cross-test consistency: Ensure all four test cases properly configure shared resources (frontend, workers) and clean up state between runs

Poem

🐰✨ With Dynamo's might and workers true,
We test when systems break right through,
Migrations dance when workers fall,
Our test suite catches them all!
Eight tokens trimmed, the changes gleam—
Fault tolerance's a rabbit's dream! 🌟

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'test: Extend Migration Tests to SGLang' is clear and specific, accurately summarizing the main change of adding SGLang migration tests.
Description check ✅ Passed The pull request description follows the required template with all main sections (Overview, Details, Where should the reviewer start, Related Issues) completed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/fault_tolerance/migration/test_sglang.py (2)

55-55: Port derivation from worker_id[-1] is fragile but acceptable for current scope.

The pattern f"808{worker_id[-1]}" works for "worker1", "worker2", etc., but would break for "worker10" (yielding port 8080). Consider extracting an explicit port parameter if the test suite expands to more workers.


103-331: Consider parameterizing similar test cases in the future.

The four tests share significant structural similarity (frontend/worker setup, request sending, worker termination). While the current explicit approach is clear and maintainable, future expansion could benefit from @pytest.mark.parametrize to reduce duplication. This is optional given the distinct validation logic for each scenario.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135bce4 and ff42ba4.

📒 Files selected for processing (2)
  • tests/fault_tolerance/migration/test_sglang.py (1 hunks)
  • tests/fault_tolerance/migration/utils.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
📚 Learning: 2025-12-02T18:13:40.037Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.037Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

  • tests/fault_tolerance/migration/test_sglang.py
🧬 Code graph analysis (1)
tests/fault_tolerance/migration/test_sglang.py (2)
tests/utils/managed_process.py (2)
  • ManagedProcess (71-568)
  • terminate_process_tree (45-67)
tests/fault_tolerance/migration/utils.py (4)
  • determine_request_receiving_worker (93-157)
  • start_completion_request (49-90)
  • validate_completion_response (160-197)
  • verify_migration_occurred (200-218)
🪛 Ruff (0.14.7)
tests/fault_tolerance/migration/test_sglang.py

108-108: Unused function argument: runtime_services

(ARG001)


108-108: Unused function argument: predownload_models

(ARG001)


108-108: Unused function argument: set_ucx_tls_no_mm

(ARG001)


158-158: Unused function argument: runtime_services

(ARG001)


158-158: Unused function argument: predownload_models

(ARG001)


158-158: Unused function argument: set_ucx_tls_no_mm

(ARG001)


209-209: Unused function argument: runtime_services

(ARG001)


209-209: Unused function argument: predownload_models

(ARG001)


209-209: Unused function argument: set_ucx_tls_no_mm

(ARG001)


273-273: Unused function argument: runtime_services

(ARG001)


273-273: Unused function argument: predownload_models

(ARG001)


273-273: Unused function argument: set_ucx_tls_no_mm

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
tests/fault_tolerance/migration/utils.py (1)

60-60: LGTM!

The reduction from 8192 to 8000 provides sensible headroom below the context-length limit (8192) configured in the SGLang worker, preventing potential edge-case issues when the request approaches the exact limit.

tests/fault_tolerance/migration/test_sglang.py (5)

84-100: LGTM!

The get_pid() and is_ready() methods are well-implemented with appropriate error handling and logging.


103-150: Well-structured migration test with proper fixture usage.

The static analysis warnings about unused arguments (runtime_services, predownload_models, set_ucx_tls_no_mm) are false positives—these are pytest fixtures applied for their side effects. The test flow is clear and comprehensive.


153-201: LGTM!

Correctly tests the graceful shutdown path using SIGTERM (immediate_kill=False) rather than SIGKILL, validating that migration works during controlled shutdown scenarios.


268-331: LGTM with the same caveat as the worker failure test.

The test follows the same pattern as test_no_request_migration_sglang_worker_failure. The same verification concern about the expected failure mode applies here (see previous comment).


256-265: The assertion check is valid and correctly tests the expected failure mode. When migration_limit=0 disables migration, the system still attempts stream reconnection on worker failure (logging "Stream disconnected... recreating stream..."), but the reconnection fails with the "Cannot recreate stream: " error. This is the expected behavior confirmed by both the SGLang and vLLM test implementations, which use identical assertion patterns. No changes needed.

Likely an incorrect or invalid review comment.

…mented yet

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants