Fix AC(compile(model)) by disabling Dynamo LRU cache (#1991)

xmfan · web-flow · commit b6b2c2de777a · 2025-11-06T16:46:50.000-08:00
Stacked PRs: * __->__#1991 --- --- --- FIXES #1971 Description of the fix is in pytorch/pytorch#166926 (There's a default on fix tackled by @williamwen42). Briefly, disabling the Dynamo LRU cache will ensure that the graph used at recompute time is the same as the one used during the original forward. This issue happens when the same python code object (module/function) has multiple valid graphs e.g. one with static shapes and one with dynamic shapes. Requires pytorch/pytorch#167038 Turning off the LRU cache can increase dynamo cache lookup overhead, however this should not affect torchtitan since we ensure relatively few graphs (usually 0 for bf16, or 1 for mxfp8) for each torch.compile wrapped code object.
diff --git a/torchtitan/models/llama4/infra/parallelize.py b/torchtitan/models/llama4/infra/parallelize.py
@@ -514,6 +514,8 @@ def apply_compile(model: nn.Module, compile_config: CompileConfig):
     # NOTE: This flag is needed for torch.compile to avoid graph breaking on dynamic shapes in token-choice MoE
     # but it is experimental.
     torch._dynamo.config.capture_scalar_outputs = True
+    # Workaround for https://github.com/pytorch/pytorch/issues/166926
+    torch._C._dynamo.eval_frame._set_lru_cache(False)
     for layer_id, transformer_block in model.layers.named_children():
         if transformer_block.moe_enabled:
             # If it is a MoE layer, FSDP(GroupedExperts) will cause a graph break