Skip to content

Commit b6b2c2d

Browse files
authored
Fix AC(compile(model)) by disabling Dynamo LRU cache (#1991)
Stacked PRs: * __->__#1991 --- --- --- FIXES #1971 Description of the fix is in pytorch/pytorch#166926 (There's a default on fix tackled by @williamwen42). Briefly, disabling the Dynamo LRU cache will ensure that the graph used at recompute time is the same as the one used during the original forward. This issue happens when the same python code object (module/function) has multiple valid graphs e.g. one with static shapes and one with dynamic shapes. Requires pytorch/pytorch#167038 Turning off the LRU cache can increase dynamo cache lookup overhead, however this should not affect torchtitan since we ensure relatively few graphs (usually 0 for bf16, or 1 for mxfp8) for each torch.compile wrapped code object.
1 parent 268020d commit b6b2c2d

File tree

1 file changed

+2
-0
lines changed

1 file changed

+2
-0
lines changed

torchtitan/models/llama4/infra/parallelize.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -514,6 +514,8 @@ def apply_compile(model: nn.Module, compile_config: CompileConfig):
514514
# NOTE: This flag is needed for torch.compile to avoid graph breaking on dynamic shapes in token-choice MoE
515515
# but it is experimental.
516516
torch._dynamo.config.capture_scalar_outputs = True
517+
# Workaround for https://github.com/pytorch/pytorch/issues/166926
518+
torch._C._dynamo.eval_frame._set_lru_cache(False)
517519
for layer_id, transformer_block in model.layers.named_children():
518520
if transformer_block.moe_enabled:
519521
# If it is a MoE layer, FSDP(GroupedExperts) will cause a graph break

0 commit comments

Comments
 (0)