feat: add NPU process group initialization and management. #456

yingxudeng · 2025-11-29T16:25:28Z

No description provided.

yingxudeng · 2025-11-29T16:27:50Z

We are currently encountering compilation issues when introducing processGroupHccl. We might need to update the image again.

xllm/core/common/CMakeLists.txt

XuZhang99 · 2025-11-30T06:54:32Z

xllm/core/framework/parallel_state/npu_process_group.cpp

+    : ProcessGroup(device) {
+  c10::intrusive_ptr<c10d_npu::ProcessGroupHCCL::Options> hccl_pg_options =
+      c10d_npu::ProcessGroupHCCL::Options::create();
+  // hccl_pg_options->group_name = group_name;


#if TORCH_VERSION_MAJOR >= 2 && TORCH_VERSION_MINOR >= 7 pg_options->group_name = group_name; #endif

Thanks for the suggestion! I'll add the version check.

To ensure forward compatibility (e.g., for PyTorch 3.0 where MINOR might be 0), I'll adjust the logic to cover cases where MAJOR > 2 as well

XuZhang99 · 2025-11-30T06:56:10Z

xllm/core/framework/parallel_state/collective_communicator.cpp

+#include "npu_process_group.h"
 #include "xllm_kernels/core/include/atb_speed/base/external_comm_manager.h"
 #include "xllm_kernels/core/include/atb_speed/utils/singleton.h"
 #include "xllm_kernels/models/base/param/mapping.h"


line 22-24 seems useless?

Thanks for the review. These lines were not added by me, but I verified that lines 22-23 are actually necessary for atb_speed::GetSingleton. However, line 24 is indeed unused, so I will remove it.

XuZhang99 · 2025-11-30T06:58:52Z

xllm/core/framework/parallel_state/npu_process_group.cpp

 limitations under the License.
 ==============================================================================*/

 #include "npu_process_group.h"


npu_process_group.cpp should be deleted, because npu_process_group.h is enough, like cuda/mlu_process_group.h.

Thanks for the review. However, I strongly prefer to keep the .cpp file. Defining implementation directly in the header is generally considered bad practice. I hope you understand my decision to maintain this separation.

XuZhang99 · 2025-11-30T07:00:26Z

xllm/core/framework/parallel_state/npu_process_group.cpp

-  // for (int i = 0; i < outputs.size(); ++i) {
-  //   outputs[i].copy_(flattened_output[i], /*non_blocking=*/true);
-  // }
+std::unique_ptr<xllm::ProcessGroup> create_process_group(


create_process_group function can placed into an anonymous namespace in collective_communicator.cpp for all devices.

Thanks for the feedback.

Regarding the suggestion to consolidate the create_process_group functions, I have a few concerns:

Since ProcessGroupHCCL, ProcessGroupCncl, and ProcessGroupNccl are device-specific implementations , moving them to collective_communicator.cpp would introduce excessive #if/#elif preprocessor directives.

xllm/core/framework/parallel_state/npu_process_group.cpp

xllm/core/common/global_flags.cpp

yingxudeng · 2025-12-01T16:22:57Z

I have updated the three images on the main branch, and the build is now passing. Would you mind taking a look at the code to see if it is ready for merging? I would appreciate your review. @yq33victor

feat: add NPU process group initialization and management.

ce55f01

yingxudeng requested review from DongheJin, XuZhang99, liutongxuan and yq33victor November 29, 2025 16:25

XuZhang99 reviewed Nov 30, 2025

View reviewed changes

XuZhang99 reviewed Dec 1, 2025

View reviewed changes

xllm/core/framework/parallel_state/npu_process_group.cpp Outdated Show resolved Hide resolved

liujinguang0125 reviewed Dec 1, 2025

View reviewed changes

xllm/core/common/global_flags.cpp Show resolved Hide resolved

refactor: cleanup headers and optimize branch checks.

3b6f3d1

yingxudeng force-pushed the feat/npu_backend_torch branch from 9b68705 to 3b6f3d1 Compare December 1, 2025 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add NPU process group initialization and management. #456

feat: add NPU process group initialization and management. #456

Uh oh!

yingxudeng commented Nov 29, 2025

Uh oh!

yingxudeng commented Nov 29, 2025

Uh oh!

Uh oh!

XuZhang99 Nov 30, 2025

Uh oh!

yingxudeng Dec 1, 2025

Uh oh!

XuZhang99 Nov 30, 2025

Uh oh!

yingxudeng Dec 1, 2025

Uh oh!

XuZhang99 Nov 30, 2025

Uh oh!

yingxudeng Dec 1, 2025

Uh oh!

XuZhang99 Nov 30, 2025

Uh oh!

yingxudeng Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

yingxudeng commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add NPU process group initialization and management. #456

Are you sure you want to change the base?

feat: add NPU process group initialization and management. #456

Uh oh!

Conversation

yingxudeng commented Nov 29, 2025

Uh oh!

yingxudeng commented Nov 29, 2025

Uh oh!

Uh oh!

XuZhang99 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

XuZhang99 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

XuZhang99 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

XuZhang99 Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

yingxudeng Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yingxudeng commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yingxudeng Dec 1, 2025 •

edited

Loading