[WIP] Add a loss comparison script #2029

fegin · 2025-11-12T21:12:13Z

Stack from ghstack (oldest at bottom):

Summary

This PR adds scripts/loss_compare.py for comparing training losses between different git commits and/or training configurations.

Key Features

Commit Comparison: Compare losses between two different git commits with deterministic training
Configuration Comparison: Compare different training configurations on the same commit
Reproducibility: Automatically enables deterministic mode and seed checkpointing for reproducible
comparisons
Real-time Output: Streams training output to both console and log files during execution
Statistical Analysis: Generates step-by-step loss comparisons and summary statistics
CI Testing: Includes --assert-equal flag for automated testing to verify identical losses

Usage Examples

Compare two commits

python3 ./scripts/loss_compare.py main my_branch

Compare two commits with custom configuration

python3 ./scripts/loss_compare.py main my_branch \
--baseline-cmd="CONFIG_FILE='./custom.toml' ./run_train.sh --parallelism.tensor_parallel_degree=2"  \

Compare different parallelization strategies on same commit

python3 ./scripts/loss_compare.py . . \
--baseline-cmd="CONFIG_FILE='./llama3_8b.toml' ./run_train.sh --parallelism.tensor_parallel_degree=2" \
--test-cmd="CONFIG_FILE='./llama3_8b.toml' ./run_train.sh" \

Assert equality for CI testing

python3 ./scripts/loss_compare.py main my_branch --assert-equal

Real Use Cases

Compare full dtensor simple fsdp with fsdp2:

python3 scripts/loss_compare.py . . \
--baseline-cmd='CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --activation_checkpoint.mode="none"'  \
--test-cmd='TRAIN_FILE=torchtitan.experiments.full_dtensor.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok

Compare full dtensor simple hsdp with hsdp2:

python3 scripts/loss_compare.py . . \
--baseline-cmd='CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-cmd='TRAIN_FILE=torchtitan.experiments.full_dtensor.train CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 2 steps from baseline log
[LOSS_COMPARE] Extracted 2 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... FAIL

======================================================================
FAIL: test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/chienchin/mywork/torchtitan/scripts/loss_compare.py", line 557, in test_losses_equal
    self.assertEqual(
AssertionError: 7.8168 != 7.8178 : Loss mismatch at step 2: baseline=7.8168, test=7.8178

----------------------------------------------------------------------
Ran 1 test in 0.000s

[ghstack-poisoned]

ghstack-source-id: ec9e87d Pull-Request: #2029

[ghstack-poisoned]

ghstack-source-id: 9a16e69 Pull-Request: #2029

[ghstack-poisoned]

ghstack-source-id: 7cac102 Pull-Request: #2029

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * #2027 * __->__ #2026 As title

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * __->__ #2027 * #2026 Dry run mode works but it doesn't exit gracefully for all cases. This PR fixes it ``` DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --activation_checkpoint.mode="none" --debug.deterministic --debug.seed=42 ```

ghstack-source-id: 7cac102 Pull-Request: #2029

[ghstack-poisoned]

ghstack-source-id: 42c9629 Pull-Request: #2029

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * __->__ #2030 The current CompileModule will result in an "inner" prefix for everything. This PR fixes it by overloading the methods. Also merge #2028 to this PR. Something wrong with ghstack.

Update

dd979ef

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 12, 2025 21:12

fegin added a commit that referenced this pull request Nov 12, 2025

Add a loss comparison script

10c59af

ghstack-source-id: ec9e87d Pull-Request: #2029

This was referenced Nov 12, 2025

Add .claude to .gitignore #2026

Merged

Fix dry run mode #2027

Merged

[Compiler Toolkit] Improve compiler toolkit logging #2028

Closed

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 12, 2025

Update

4b4e547

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 12, 2025

Add a loss comparison script

a523c07

ghstack-source-id: 9a16e69 Pull-Request: #2029

fegin changed the title ~~Add a loss comparison script~~ [WIP] Add a loss comparison script Nov 12, 2025

fegin marked this pull request as draft November 12, 2025 21:49

Update

4f2b8bd

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 12, 2025

Add a loss comparison script

23cc5ce

ghstack-source-id: 7cac102 Pull-Request: #2029

fegin mentioned this pull request Nov 12, 2025

[Compiler Toolkit] Make compiler toolkit work with checkpoint #2030

Merged

fegin added a commit that referenced this pull request Nov 13, 2025

Add .claude to .gitignore (#2026)

4b2b31c

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * #2027 * __->__ #2026 As title

fegin added a commit that referenced this pull request Nov 13, 2025

Add a loss comparison script

390bd77

ghstack-source-id: 7cac102 Pull-Request: #2029

Update

74f9fb7

[ghstack-poisoned]

fegin added a commit that referenced this pull request Nov 13, 2025

Add a loss comparison script

e47c13b

ghstack-source-id: 42c9629 Pull-Request: #2029

fegin added a commit that referenced this pull request Nov 13, 2025

Add a loss comparison script

3b13f0d

ghstack-source-id: 42c9629 Pull-Request: #2029

fegin added a commit that referenced this pull request Nov 13, 2025

Add a loss comparison script

1844e68

ghstack-source-id: 42c9629 Pull-Request: #2029

fegin added a commit that referenced this pull request Nov 13, 2025

Add a loss comparison script

916525b

ghstack-source-id: 42c9629 Pull-Request: #2029

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add a loss comparison script #2029

[WIP] Add a loss comparison script #2029

Uh oh!

fegin commented Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Add a loss comparison script #2029

Are you sure you want to change the base?

[WIP] Add a loss comparison script #2029

Uh oh!

Conversation

fegin commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Usage Examples

Compare two commits

Compare two commits with custom configuration

Compare different parallelization strategies on same commit

Assert equality for CI testing

Real Use Cases

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fegin commented Nov 12, 2025 •

edited

Loading