test: Extend Migration Tests to SGLang #4711

kthui · 2025-12-03T02:23:31Z

Overview:

Extend existing vLLM Migration Tests into SGLang.

Details:

vLLM migration under graceful shutdown happened in < 1 sec, but SGLang took > 50 secs, which will be investigated in a future ticket.

Where should the reviewer start?

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

Release Notes

Tests
- Added comprehensive test coverage for fault-tolerance and worker migration scenarios, including failure handling and graceful shutdown validation.
Chores
- Adjusted maximum token configuration for completion requests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

coderabbitai · 2025-12-03T02:27:20Z

Walkthrough

This PR introduces a new SGLang fault-tolerance migration test module with a DynamoWorkerProcess class for worker lifecycle management and four end-to-end migration test cases covering worker failure, graceful shutdown, and disabled migration scenarios. Additionally, a token limit parameter is adjusted in test utilities.

Changes

Cohort / File(s)	Summary
SGLang Migration Test Module `tests/fault_tolerance/migration/test_sglang.py`	New test module introducing `DynamoWorkerProcess` class (extends `ManagedProcess`) with process initialization, environment setup, log cleanup, health checks, and `get_pid()` / `is_ready()` methods. Adds four end-to-end test functions: `test_request_migration_sglang_worker_failure()`, `test_request_migration_sglang_graceful_shutdown()`, `test_no_request_migration_sglang_worker_failure()`, and `test_no_request_migration_sglang_graceful_shutdown()` evaluating migration behavior under worker failure, graceful termination, and disabled migration conditions.
Token Limit Adjustment `tests/fault_tolerance/migration/utils.py`	Reduced maximum tokens for completion requests from 8192 to 8000.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

DynamoWorkerProcess class: Verify process lifecycle management, initialization logic, environment variable setup, and health check implementation for correctness
Test functions: Validate migration detection logic, request routing assertions, worker termination handling, and expected failure/success paths across four distinct scenarios
Cross-test consistency: Ensure all four test cases properly configure shared resources (frontend, workers) and clean up state between runs

Poem

🐰✨ With Dynamo's might and workers true,
We test when systems break right through,
Migrations dance when workers fall,
Our test suite catches them all!
Eight tokens trimmed, the changes gleam—
Fault tolerance's a rabbit's dream! 🌟

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'test: Extend Migration Tests to SGLang' is clear and specific, accurately summarizing the main change of adding SGLang migration tests.
Description check	✅ Passed	The pull request description follows the required template with all main sections (Overview, Details, Where should the reviewer start, Related Issues) completed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tests/fault_tolerance/migration/test_sglang.py (2)

55-55: Port derivation from worker_id[-1] is fragile but acceptable for current scope.

The pattern f"808{worker_id[-1]}" works for "worker1", "worker2", etc., but would break for "worker10" (yielding port 8080). Consider extracting an explicit port parameter if the test suite expands to more workers.

103-331: Consider parameterizing similar test cases in the future.

The four tests share significant structural similarity (frontend/worker setup, request sending, worker termination). While the current explicit approach is clear and maintainable, future expansion could benefit from @pytest.mark.parametrize to reduce duplication. This is optional given the distinct validation logic for each scenario.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 135bce4 and ff42ba4.

📒 Files selected for processing (2)

tests/fault_tolerance/migration/test_sglang.py (1 hunks)
tests/fault_tolerance/migration/utils.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:6-6
Timestamp: 2025-09-18T21:47:44.143Z
Learning: For PR ai-dynamo/dynamo#2989, the ConnectorTransferBatcher architectural issues will be addressed in a follow-up PR by removing the duplicate batching logic and integrating distributed transfers with the existing TransferBatcher + LocalTransferManager pipeline, rather than adding bounded concurrency primitives like Semaphore.

📚 Learning: 2025-12-02T18:13:40.037Z

Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 4698
File: .github/workflows/container-validation-dynamo.yml:68-68
Timestamp: 2025-12-02T18:13:40.037Z
Learning: In the ai-dynamo/dynamo repository, backend-specific tests (vllm, sglang, trtllm) are intentionally excluded from the container-validation-dynamo.yml workflow using "not (vllm or sglang or trtllm)" because they run in a separate container-validation-backends.yml workflow that has dedicated jobs for each backend. This separation keeps framework-agnostic tests separate from backend-specific tests.

Applied to files:

tests/fault_tolerance/migration/test_sglang.py

🧬 Code graph analysis (1)

tests/fault_tolerance/migration/test_sglang.py (2)

tests/utils/managed_process.py (2)

ManagedProcess (71-568)

terminate_process_tree (45-67)

tests/fault_tolerance/migration/utils.py (4)

determine_request_receiving_worker (93-157)

start_completion_request (49-90)

validate_completion_response (160-197)

verify_migration_occurred (200-218)

🪛 Ruff (0.14.7)

tests/fault_tolerance/migration/test_sglang.py

108-108: Unused function argument: runtime_services

(ARG001)

108-108: Unused function argument: predownload_models

(ARG001)

108-108: Unused function argument: set_ucx_tls_no_mm

(ARG001)

158-158: Unused function argument: runtime_services

(ARG001)

158-158: Unused function argument: predownload_models

(ARG001)

158-158: Unused function argument: set_ucx_tls_no_mm

(ARG001)

209-209: Unused function argument: runtime_services

(ARG001)

209-209: Unused function argument: predownload_models

(ARG001)

209-209: Unused function argument: set_ucx_tls_no_mm

(ARG001)

273-273: Unused function argument: runtime_services

(ARG001)

273-273: Unused function argument: predownload_models

(ARG001)

273-273: Unused function argument: set_ucx_tls_no_mm

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: trtllm (amd64)
GitHub Check: trtllm (arm64)
GitHub Check: sglang (amd64)
GitHub Check: sglang (arm64)
GitHub Check: operator (arm64)
GitHub Check: operator (amd64)
GitHub Check: vllm (arm64)
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (6)

tests/fault_tolerance/migration/utils.py (1)

60-60: LGTM!

The reduction from 8192 to 8000 provides sensible headroom below the context-length limit (8192) configured in the SGLang worker, preventing potential edge-case issues when the request approaches the exact limit.

tests/fault_tolerance/migration/test_sglang.py (5)

84-100: LGTM!

The get_pid() and is_ready() methods are well-implemented with appropriate error handling and logging.

103-150: Well-structured migration test with proper fixture usage.

The static analysis warnings about unused arguments (runtime_services, predownload_models, set_ucx_tls_no_mm) are false positives—these are pytest fixtures applied for their side effects. The test flow is clear and comprehensive.

153-201: LGTM!

Correctly tests the graceful shutdown path using SIGTERM (immediate_kill=False) rather than SIGKILL, validating that migration works during controlled shutdown scenarios.

268-331: LGTM with the same caveat as the worker failure test.

The test follows the same pattern as test_no_request_migration_sglang_worker_failure. The same verification concern about the expected failure mode applies here (see previous comment).

256-265: The assertion check is valid and correctly tests the expected failure mode. When migration_limit=0 disables migration, the system still attempts stream reconnection on worker failure (logging "Stream disconnected... recreating stream..."), but the reconnection fails with the "Cannot recreate stream: " error. This is the expected behavior confirmed by both the SGLang and vLLM test implementations, which use identical assertion patterns. No changes needed.

Likely an incorrect or invalid review comment.

…mented yet Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

kthui self-assigned this Dec 3, 2025

pull-request-size bot added the size/L label Dec 3, 2025

github-actions bot added the test label Dec 3, 2025

kthui marked this pull request as ready for review December 3, 2025 02:23

kthui requested review from a team as code owners December 3, 2025 02:23

kthui added 2 commits December 2, 2025 18:25

chore: Update request token count for 8192 model size

8d4884f

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

test: Extend migration tests to SGLang

677484f

Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

kthui force-pushed the jacky-ft-migrate-test-sglang branch from ff42ba4 to 677484f Compare December 3, 2025 02:25

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 02:26 Inactive

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

test: Skip graceful shutdown migration tests since they are not imple…

0a030e1

…mented yet Signed-off-by: Jacky <18255193+kthui@users.noreply.github.com>

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 02:46 Inactive

copy-pr-bot bot temporarily deployed to GITLAB December 3, 2025 02:47 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Extend Migration Tests to SGLang #4711

test: Extend Migration Tests to SGLang #4711

kthui commented Dec 3, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 3, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

test: Extend Migration Tests to SGLang #4711

Are you sure you want to change the base?

test: Extend Migration Tests to SGLang #4711

Conversation

kthui commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kthui commented Dec 3, 2025 •

edited

Loading

coderabbitai bot commented Dec 3, 2025 •

edited

Loading