-
Notifications
You must be signed in to change notification settings - Fork 720
test: Extend Migration Tests to SGLang #4711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
ff42ba4 to
677484f
Compare
WalkthroughThis PR introduces a new SGLang fault-tolerance migration test module with a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/fault_tolerance/migration/test_sglang.py (2)
55-55: Port derivation fromworker_id[-1]is fragile but acceptable for current scope.The pattern
f"808{worker_id[-1]}"works for "worker1", "worker2", etc., but would break for "worker10" (yielding port 8080). Consider extracting an explicit port parameter if the test suite expands to more workers.
103-331: Consider parameterizing similar test cases in the future.The four tests share significant structural similarity (frontend/worker setup, request sending, worker termination). While the current explicit approach is clear and maintainable, future expansion could benefit from
@pytest.mark.parametrizeto reduce duplication. This is optional given the distinct validation logic for each scenario.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/fault_tolerance/migration/test_sglang.py(1 hunks)tests/fault_tolerance/migration/utils.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.
📚 Learning: 2025-12-02T18:13:40.037Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.037Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.
Applied to files:
tests/fault_tolerance/migration/test_sglang.py
🧬 Code graph analysis (1)
tests/fault_tolerance/migration/test_sglang.py (2)
tests/utils/managed_process.py (2)
ManagedProcess(71-568)terminate_process_tree(45-67)tests/fault_tolerance/migration/utils.py (4)
determine_request_receiving_worker(93-157)start_completion_request(49-90)validate_completion_response(160-197)verify_migration_occurred(200-218)
🪛 Ruff (0.14.7)
tests/fault_tolerance/migration/test_sglang.py
108-108: Unused function argument: runtime_services
(ARG001)
108-108: Unused function argument: predownload_models
(ARG001)
108-108: Unused function argument: set_ucx_tls_no_mm
(ARG001)
158-158: Unused function argument: runtime_services
(ARG001)
158-158: Unused function argument: predownload_models
(ARG001)
158-158: Unused function argument: set_ucx_tls_no_mm
(ARG001)
209-209: Unused function argument: runtime_services
(ARG001)
209-209: Unused function argument: predownload_models
(ARG001)
209-209: Unused function argument: set_ucx_tls_no_mm
(ARG001)
273-273: Unused function argument: runtime_services
(ARG001)
273-273: Unused function argument: predownload_models
(ARG001)
273-273: Unused function argument: set_ucx_tls_no_mm
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: trtllm (amd64)
- GitHub Check: trtllm (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (6)
tests/fault_tolerance/migration/utils.py (1)
60-60: LGTM!The reduction from 8192 to 8000 provides sensible headroom below the context-length limit (8192) configured in the SGLang worker, preventing potential edge-case issues when the request approaches the exact limit.
tests/fault_tolerance/migration/test_sglang.py (5)
84-100: LGTM!The
get_pid()andis_ready()methods are well-implemented with appropriate error handling and logging.
103-150: Well-structured migration test with proper fixture usage.The static analysis warnings about unused arguments (
runtime_services,predownload_models,set_ucx_tls_no_mm) are false positives—these are pytest fixtures applied for their side effects. The test flow is clear and comprehensive.
153-201: LGTM!Correctly tests the graceful shutdown path using SIGTERM (
immediate_kill=False) rather than SIGKILL, validating that migration works during controlled shutdown scenarios.
268-331: LGTM with the same caveat as the worker failure test.The test follows the same pattern as
test_no_request_migration_sglang_worker_failure. The same verification concern about the expected failure mode applies here (see previous comment).
256-265: The assertion check is valid and correctly tests the expected failure mode. Whenmigration_limit=0disables migration, the system still attempts stream reconnection on worker failure (logging "Stream disconnected... recreating stream..."), but the reconnection fails with the "Cannot recreate stream: " error. This is the expected behavior confirmed by both the SGLang and vLLM test implementations, which use identical assertion patterns. No changes needed.Likely an incorrect or invalid review comment.
…mented yet Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>
Overview:
Extend existing vLLM Migration Tests into SGLang.
Details:
vLLM migration under graceful shutdown happened in < 1 sec, but SGLang took > 50 secs, which will be investigated in a future ticket.
Where should the reviewer start?
N/A
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A
Summary by CodeRabbit
Release Notes
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.