Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 12, 2025

Stack from ghstack (oldest at bottom):

Summary

This PR adds scripts/loss_compare.py for comparing training losses between different git commits and/or training configurations.

Key Features

  • Commit Comparison: Compare losses between two different git commits with deterministic training
  • Configuration Comparison: Compare different training configurations on the same commit
  • Reproducibility: Automatically enables deterministic mode and seed checkpointing for reproducible
    comparisons
  • Real-time Output: Streams training output to both console and log files during execution
  • Statistical Analysis: Generates step-by-step loss comparisons and summary statistics
  • CI Testing: Includes --assert-equal flag for automated testing to verify identical losses

Usage Examples

Compare two commits

python3 ./scripts/loss_compare.py main my_branch

Compare two commits with custom configuration

python3 ./scripts/loss_compare.py main my_branch \
--baseline-cmd="CONFIG_FILE='./custom.toml' ./run_train.sh --parallelism.tensor_parallel_degree=2"  \

Compare different parallelization strategies on same commit

python3 ./scripts/loss_compare.py . . \
--baseline-cmd="CONFIG_FILE='./llama3_8b.toml' ./run_train.sh --parallelism.tensor_parallel_degree=2" \
--test-cmd="CONFIG_FILE='./llama3_8b.toml' ./run_train.sh" \

Assert equality for CI testing

python3 ./scripts/loss_compare.py main my_branch --assert-equal

Real Use Cases

Compare full dtensor simple fsdp with fsdp2:

python3 scripts/loss_compare.py . . \
--baseline-cmd='CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --activation_checkpoint.mode="none"'  \
--test-cmd='TRAIN_FILE=torchtitan.experiments.full_dtensor.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok

Compare full dtensor simple hsdp with hsdp2:

python3 scripts/loss_compare.py . . \
--baseline-cmd='CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-cmd='TRAIN_FILE=torchtitan.experiments.full_dtensor.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 2 steps from baseline log
[LOSS_COMPARE] Extracted 2 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... FAIL

======================================================================
FAIL: test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/chienchin/mywork/torchtitan/scripts/loss_compare.py", line 557, in test_losses_equal
    self.assertEqual(
AssertionError: 7.8168 != 7.8178 : Loss mismatch at step 2: baseline=7.8168, test=7.8178

----------------------------------------------------------------------
Ran 1 test in 0.000s

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 12, 2025
ghstack-source-id: ec9e87d
Pull-Request: #2029
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 12, 2025
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 12, 2025
ghstack-source-id: 9a16e69
Pull-Request: #2029
@fegin fegin changed the title Add a loss comparison script [WIP] Add a loss comparison script Nov 12, 2025
@fegin fegin marked this pull request as draft November 12, 2025 21:49
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 12, 2025
ghstack-source-id: 7cac102
Pull-Request: #2029
fegin added a commit that referenced this pull request Nov 13, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* #2027
* __->__ #2026

As title
fegin added a commit that referenced this pull request Nov 13, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* __->__ #2027
* #2026

Dry run mode works but it doesn't exit gracefully for all cases. This PR
fixes it

```
DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh   --training.steps=10 --activation_checkpoint.mode="none"
--debug.deterministic --debug.seed=42
```
fegin added a commit that referenced this pull request Nov 13, 2025
ghstack-source-id: 7cac102
Pull-Request: #2029
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 13, 2025
ghstack-source-id: 42c9629
Pull-Request: #2029
fegin added a commit that referenced this pull request Nov 13, 2025
ghstack-source-id: 42c9629
Pull-Request: #2029
fegin added a commit that referenced this pull request Nov 13, 2025
ghstack-source-id: 42c9629
Pull-Request: #2029
fegin added a commit that referenced this pull request Nov 13, 2025
ghstack-source-id: 42c9629
Pull-Request: #2029
fegin added a commit that referenced this pull request Nov 13, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* __->__ #2030

The current CompileModule will result in an "inner" prefix for
everything. This
PR fixes it by overloading the methods.

Also merge #2028 to this PR.
Something wrong with ghstack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants