Skip to content

Commit ce1c0fc

Browse files
authored
Fix dry run mode (#2027)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * __->__ #2027 * #2026 Dry run mode works but it doesn't exit gracefully for all cases. This PR fixes it ``` DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --activation_checkpoint.mode="none" --debug.deterministic --debug.seed=42 ```
1 parent 4b2b31c commit ce1c0fc

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

scripts/dry_run.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,9 @@ def __init__(self, job_config: JobConfig):
151151
logger.info("Configuration is ready for training execution.")
152152
logger.info("=" * 80)
153153

154+
def train(self):
155+
return
156+
154157

155158
if __name__ == "__main__":
156159
main(DryRunTrainer)

torchtitan/train.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -735,7 +735,8 @@ def main(trainer_class: type[Trainer]) -> None:
735735
raise
736736
else:
737737
trainer.close()
738-
torch.distributed.destroy_process_group()
738+
if torch.distributed.is_initialized():
739+
torch.distributed.destroy_process_group()
739740
logger.info("Process group destroyed")
740741

741742

0 commit comments

Comments
 (0)