Reduce memory requirements for checkpoint conversion #2636

khatwanimohit · 2025-11-07T23:15:49Z

Description

This PR introduces a lazy_load configuration option that reduces peak RAM usage by deferring the loading and transformation of weights until the exact moment Orbax needs to save them to disk.

Note: lazy loading feature is temporarily disabled for multimodal models because the key names in the files are different from what are seen when loading a HF model using AutoConfig.

Key Changes

Lazy Loading Mechanism: Implemented LazyHFLoader and LazyTensor classes. Instead of eagerly loading all weights, we now create lightweight proxy objects that only load their specific tensor data from disk when Orbax calls array() during the saving phase.
Orbax Integration: Registered a custom LazyTensorHandler with Orbax to allow it to correctly recognize and process our proxy objects as if they were standard NumPy arrays.
Config: Added a new boolean flag lazy_load to the MaxText configuration to enable this mode. It defaults to False to preserve existing behavior for smaller models.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/458745828

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

x here denotes number of B parameters

Llama.3.1-70B

old RAM usage: 616 GB (8.8x GB)
new RAM usage: 86.2 GB (1.2x GB)
Logs: https://paste.googleplex.com/6288510168465408#l=571

LLama3.1-8B

old RAM usage: 51 GB (6.3x GB)
new RAM usage: 31GB (4x GB)
Logs: https://paste.googleplex.com/6128000672333824
forward pass logits test: https://paste.googleplex.com/5595374018494464

Qwen3-4B

old RAM usage: 37 GB ( 9.2x GB)
new RAM usage: 15.7 GB ( 3.7x GB )
Logs: https://paste.googleplex.com/4933249042350080#l=543
forward pass logit test: https://paste.googleplex.com/5515770054443008

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

github-actions · 2025-11-10T17:43:28Z

🤖 Hi @khatwanimohit, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request introduces a lazy_load feature for checkpoint conversion, which significantly reduces peak memory usage. The implementation is well-structured, introducing LazyHFLoader and LazyTensor classes to handle on-demand loading of model weights, and integrating them with Orbax through a custom LazyTensorHandler.

🔍 General Feedback

The code is clean, well-documented, and includes helpful additions like RAM usage logging and a memory-monitoring progress bar.
The separate handling for safetensors and PyTorch binary files is robust.
The use of functools.partial to create loading functions is elegant.

I've left a few minor suggestions for typos and docstring updates. Overall, this is an excellent contribution that will be very beneficial for users working with large models.

src/MaxText/configs/base.yml

src/MaxText/utils/ckpt_conversion/to_maxtext.py

shuningjin

Thank you for adding this nice function to significantly improve the memory! If you have the time comparison, would be great to include in the PR description.

src/MaxText/configs/base.yml

src/MaxText/utils/ckpt_conversion/to_maxtext.py

RissyRan

Thank you!

khatwanimohit requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners November 7, 2025 23:15

khatwanimohit force-pushed the mohit/memory_opt branch 4 times, most recently from c3e55a8 to e99a0bd Compare November 8, 2025 01:03

khatwanimohit added the gemini-review label Nov 10, 2025

github-actions bot reviewed Nov 10, 2025

View reviewed changes

src/MaxText/configs/base.yml Outdated Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

hengtaoguo approved these changes Nov 10, 2025

View reviewed changes

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

shuningjin approved these changes Nov 11, 2025

View reviewed changes

src/MaxText/configs/base.yml Outdated Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

RissyRan reviewed Nov 11, 2025

View reviewed changes

src/MaxText/utils/ckpt_conversion/to_maxtext.py Outdated Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Outdated Show resolved Hide resolved

src/MaxText/utils/ckpt_conversion/to_maxtext.py Show resolved Hide resolved

khatwanimohit force-pushed the mohit/memory_opt branch 3 times, most recently from 81eb54e to 10cbbb0 Compare November 12, 2025 21:13

khatwanimohit requested a review from RissyRan November 12, 2025 21:49

shuningjin reviewed Nov 12, 2025

View reviewed changes

src/MaxText/utils/ckpt_conversion/to_maxtext.py Outdated Show resolved Hide resolved

RissyRan approved these changes Nov 12, 2025

View reviewed changes

memory optimization

14df4ae

khatwanimohit force-pushed the mohit/memory_opt branch from 10cbbb0 to 14df4ae Compare November 12, 2025 22:07

khatwanimohit added the pull ready label Nov 12, 2025

copybara-service bot merged commit 9642e89 into main Nov 12, 2025
35 checks passed

copybara-service bot deleted the mohit/memory_opt branch November 12, 2025 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory requirements for checkpoint conversion #2636

Reduce memory requirements for checkpoint conversion #2636

khatwanimohit commented Nov 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuningjin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reduce memory requirements for checkpoint conversion #2636

Reduce memory requirements for checkpoint conversion #2636

Conversation

khatwanimohit commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Tests

Llama.3.1-70B

LLama3.1-8B

Qwen3-4B

Checklist

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuningjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khatwanimohit commented Nov 7, 2025 •

edited

Loading