Skip to content

Conversation

@MGAMZ
Copy link

@MGAMZ MGAMZ commented Nov 30, 2025

Error when running DDP training

[rank2]: Traceback (most recent call last):
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/run.py", line 194, in <module>
[rank2]:     main()
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/run.py", line 190, in main
[rank2]:     runner.experiment_queue()
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/run.py", line 172, in experiment_queue
[rank2]:     raise e
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/run.py", line 149, in experiment_queue
[rank2]:     experiment(
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/experiment.py", line 34, in __init__
[rank2]:     self._main_process()
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/experiment.py", line 56, in _main_process
[rank2]:     runner = DynamicRunnerGenerator(self.cfg)  # build runner
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/mmeng_PlugIn.py", line 124, in DynamicRunnerGenerator
[rank2]:     return mgam_Runner.from_cfg(cfg)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/zyq_local/mgam_repos/onedl-mmengine/mmengine/runner/runner.py", line 462, in from_cfg
[rank2]:     runner = cls(
[rank2]:              ^^^^
[rank2]:   File "/zyq_local/mgam_repos/ITKIT/itkit/mm/mmeng_PlugIn.py", line 62, in __init__
[rank2]:     super().__init__(experiment_name=exp_name_in_runner, **kwargs)
[rank2]:   File "/zyq_local/mgam_repos/onedl-mmengine/mmengine/runner/runner.py", line 431, in __init__
[rank2]:     self.model = self.wrap_model(
[rank2]:                  ^^^^^^^^^^^^^^^^
[rank2]:   File "/zyq_local/mgam_repos/onedl-mmengine/mmengine/runner/runner.py", line 905, in wrap_model
[rank2]:     model_wrapper_type = MODEL_WRAPPERS.get(
[rank2]:                          ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/zyq_local/mgam_repos/onedl-mmengine/mmengine/registry/registry.py", line 441, in get
[rank2]:     raise TypeError(
[rank2]: TypeError: The key argument of `Registry.get` must be a str, got <class 'itkit.mm.mmeng_PlugIn.RemasteredDDP'>

Another fork branch seems to not having this BUG: https://github.com/MGAMZ/mmengine/tree/dev/mgam
UNKNOWN reason, still debuging.

Possibly related to this commit: open-mmlab@91d945f

@read-the-docs-community
Copy link

read-the-docs-community bot commented Nov 30, 2025

Documentation build overview

📚 onedl-mmengine | 🛠️ Build #30533970 | 📁 Comparing 6990d7f against latest (167a8ab)


🔍 Preview build

Show files changed (5 files in total): 📝 5 modified | ➕ 0 added | ➖ 0 deleted
File Status
advanced_tutorials/data_transform.html 📝 modified
advanced_tutorials/test_time_augmentation.html 📝 modified
migration/transform.html 📝 modified
api/generated/mmengine.model.BaseTTAModel.html 📝 modified
_modules/mmengine/runner/runner.html 📝 modified

@lauriebax
Copy link

It looks like the code calling the registry passes an instantiated class instead of the name of the class that later gets instantiated. This breaks the assumption that all objects are instantiated by the builder (and thus that type should be a string and not a class). I would say this is not a bug in onedl-mmengine but in the way the other library uses onedl-mmengine

@MGAMZ MGAMZ force-pushed the dev/pythonlike-modelwrapper branch from c016e2a to 2575aeb Compare December 1, 2025 13:46
@MGAMZ
Copy link
Author

MGAMZ commented Dec 1, 2025

@lauriebax I was debugging this issue and mistakenly created a PR to onedl instead of my own fork. LOL.

Thank you for your helpful and insightful suggestion. I hereby provide a more detailed description of this PR. Looking forward to your further reply.

Introduction

Whether it is the original OpenMM or the current OneDL, mmengine has planned to use Python-style configuration files. As is described in Config Doc.

Registry.build allows either str or type (a python class) as cfg['type'] at the current implementation. Registry utilizes build_from_cfg as default self.build_func, and build_from_cfg can both handle str and type.

How to Reproduce

For most usage scenarios, compatibility improvements have been introduced. Introducing the Python-style type in the common configuration items is not a problem. However, for some less commonly used configuration items, problems may arise. In this PR, the following two items are included:

  1. model_wrapper_cfg: An advanced config item designed to customize paralleled run, and is set by default without requiring the user to configure it in the cfg.
  2. optim_wrapper.constructor: Provides an entry point for finely configuring the optimization strategy of the parameters of the neural network. It is also set by default without requiring the user to configure it in the cfg.

Manually configuring either items will cause errors during runner startup, here is an example:

from mmengine.model.wrappers import MMDistributedDataParallel

... # Other configs

model_wrapper_cfg = dict(type=MMDistributedDataParallel)

... # Other configs

Why?

It is not a problem with Registry.build as it operates normally with most of the pure python style configuration items.

It's because in some scenarios, we need to obtain the actual class of the type specified in the configuration dict before the registry doing instantiation in build, in which scenario we will use the Registry.get method. Registry.get cannnot handle str as type and will raise AssertionError as this PR's description.

As a workaround, I discover that in multiple places throughout this repository, the code first checks the type of the type value in the configuration dictionary and only calls Registry.get if it is a string; otherwise, it directly uses the type value as the result. I paste two examples here:

@staticmethod
def build_dataloader(dataloader: Union[DataLoader, Dict],
                     seed: Optional[int] = None,
                     diff_rank_seed: bool = False) -> DataLoader:

    ...

    if 'worker_init_fn' in dataloader_cfg:
        worker_init_fn_cfg = dataloader_cfg.pop('worker_init_fn')
        worker_init_fn_type = worker_init_fn_cfg.pop('type')
        if isinstance(worker_init_fn_type, str):
            worker_init_fn = FUNCTIONS.get(worker_init_fn_type)
        elif callable(worker_init_fn_type):
            worker_init_fn = worker_init_fn_type
        else:
            raise TypeError(
                'type of worker_init_fn should be string or callable '
                f'object, but got {type(worker_init_fn_type)}')
        assert callable(worker_init_fn)
        init_fn = partial(
            worker_init_fn, **worker_init_fn_cfg)
    
    ...
if isinstance(collate_fn_cfg, dict):
    collate_fn_type = collate_fn_cfg.pop('type')
    if isinstance(collate_fn_type, str):
        collate_fn = FUNCTIONS.get(collate_fn_type)
    else:
        collate_fn = collate_fn_type
    collate_fn = partial(collate_fn, **collate_fn_cfg)

Fix

My workaround follows the similar logic. The original code is implemented without protection when isinstance(model_wrapper_cfg['type'], type):

model_wrapper_type = MODEL_WRAPPERS.get(model_wrapper_cfg.get('type'))

So adding a careful if logic:

model_wrapper_type = model_wrapper_cfg.get('type')
if isinstance(model_wrapper_type, str):
    model_wrapper_type = MODEL_WRAPPERS.get(
        model_wrapper_type)  # type: ignore
else:
    assert isinstance(model_wrapper_type, type), 'type should be a string or a class'

Further Thinking

I think Registry.get can add support for str and type just as Registry.build. But this can be a quite extensive change with a wide impact.

@MGAMZ MGAMZ changed the title Fix model_wrapper and RemasteredDDP issue Fix model_wrapper_cfg with pure python style config Dec 1, 2025
@lauriebax
Copy link

Thank you for the nice explanation of the issue. This makes reviewing very easy. I agree with the fix, although I prefer you use isclass(x) rather than isinstance(x, type) for clarity.

@MGAMZ
Copy link
Author

MGAMZ commented Dec 2, 2025

Yes, isclass is definitely better. I am marking this PR ready for review.

I may create one more PR for the other similar bug related to optim_wrapper.constructor. (Was lazy to do it now~~~)

@MGAMZ MGAMZ marked this pull request as ready for review December 2, 2025 11:36
Copilot AI review requested due to automatic review settings December 2, 2025 11:36
Copilot finished reviewing on behalf of MGAMZ December 2, 2025 11:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug that caused DDP training to fail when using pure Python style config for model_wrapper_cfg. The error occurred because Registry.get() expected a string but received a class object when users passed a class directly as the type value.

Key Changes:

  • Added support for passing class objects directly as the type value in model_wrapper_cfg, in addition to string-based references
  • Added isclass import from the inspect module to validate class types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +910 to +911
assert isclass(
model_wrapper_type), 'type should be a string or a class'
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion should raise a TypeError instead of using an assert statement for better error handling. Assert statements can be disabled with Python optimization flags (-O), making this validation unreliable in production. Consider:

if not isclass(model_wrapper_type):
    raise TypeError('type should be a string or a class')
Suggested change
assert isclass(
model_wrapper_type), 'type should be a string or a class'
if not isclass(model_wrapper_type):
raise TypeError('type should be a string or a class')

Copilot uses AI. Check for mistakes.
Comment on lines +906 to +911
model_wrapper_type = model_wrapper_cfg.get('type')
if isinstance(model_wrapper_type, str):
model_wrapper_type = MODEL_WRAPPERS.get(model_wrapper_type)
else:
assert isclass(
model_wrapper_type), 'type should be a string or a class'
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new functionality that allows passing a class directly as the type value (instead of a string) lacks test coverage. Consider adding a test case that validates this behavior:

# Example test case
cfg.model_wrapper_cfg = dict(type=CustomModelWrapper)  # Pass class directly
runner = Runner.from_cfg(cfg)
self.assertIsInstance(runner.model, CustomModelWrapper)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants