ignore _update_mamba_mask for AWQ sequential tracing (#1925)

toncao · cpatonn · brian-dellabetta · web-flow · commit 4cfc0e6217c2 · 2025-10-14T22:18:39.000Z
SUMMARY: In models with mamba-2 layers e.g., [nvidia/NVIDIA-Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2), [Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct), tracing _update_mamba_masks would lead to ``` File "NemotronHModel_8045287568680_autowrapped", line 57, in forward File "/mnt/LinuxDrive/huggingface/modules/transformers_modules/NVIDIA_hyphen_Nemotron_hyphen_Nano_hyphen_12B_hyphen_v2/modeling_nemotron_h.py", line 1461, in _update_mamba_mask if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)): ^^^^^^^^^^^^^^^^^^^^^ File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/transformers/utils/fx.py", line 674, in __bool__ return super().__bool__() ^^^^^^^^^^^^^^^^^^ File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/torch/fx/proxy.py", line 577, in __bool__ return self.tracer.to_bool(self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/torch/fx/proxy.py", line 388, in to_bool raise TraceError( torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow ``` from the function: ``` def _update_mamba_mask(self, attention_mask, cache_position): -- """ No need for zeroing states when 1. Cached forward 2. Attending to all inputs """ mamba_mask = attention_mask if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)): mamba_mask = None return mamba_mask ``` And thus, adding _update_mamba_masks to the ignore tracing list makes AWQ sequential tracing works. TEST PLAN: local make test results: ``` ===================================================== short test summary info ===================================================== FAILED tests/llmcompressor/modeling/test_calib_deepseek_v3.py::test_calib_deepseekv3_module - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 14.1... FAILED tests/llmcompressor/utils/test_helpers.py::test_disable_cache[MllamaForConditionalGeneration-meta-llama/Llama-3.2-11B-Vision-Instruct] - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-68ee275c-378c35b1649b823602164fc0;24ebe331-9031-4... FAILED tests/lmeval/test_lmeval.py::TestLMEval::test_lm_eval[None] - TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'NoneType' ====================================== 3 failed, 242 passed, 4 skipped in 129.47s (0:02:09) ======================================= ``` Co-authored-by: toncao <cpatonn@gmail.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
diff --git a/src/llmcompressor/args/dataset_arguments.py b/src/llmcompressor/args/dataset_arguments.py
@@ -194,6 +194,7 @@ class DatasetArguments(CustomDatasetArguments):
         default_factory=lambda: [
             "_update_causal_mask",
             "create_causal_mask",
+            "_update_mamba_mask",
             "make_causal_mask",
             "get_causal_mask",
             "mask_interface",