Implement temporal rolling VAE (Major VRAM reductions in Hunyuan and Kandinsky) #10995

rattus128 · 2025-11-29T23:19:56Z

Instead of doing the temporal causal 3d convolutions over the full tensor, do them 2 latent frames at most at a time (this can be more real frames). This reduces the VAEs VRAM usage in the temporal dimension to a constant. For videos with any substantial number of frames this is a major reduction in VRAM usage.

Improves at least Hunyuan 1.0 and Kandinsky VAEs.

Regression tested with SDXL (shares 2d code).

All of these 480Px81f VAE ops used to tile, and now they fit comfortably (RTX5090):

Primary commits:

Author: Rattus <rattus128@gmail.com>
Date:   Sun Nov 30 08:56:07 2025 +1000

    model: Add temporal roll to main VAE encoder
    
    If there are no attention layers, its a standard resnet and VideoConv3d
    is asked for, substitute in the temporal rolling VAE algorithm. This
    reduces VAE usage by the temporal dimension (can be huge VRAM savings).

commit 6571c912a70a6e9233283467f85b7ee10b3d7b59
Author: Rattus <rattus128@gmail.com>
Date:   Sun Nov 30 08:56:07 2025 +1000

    model: Add temporal roll to main VAE decoder
    
    If there are no attention layers, its a standard resnet and VideoConv3d
    is asked for, substitute in the temporal rolloing VAE algorithm. This
    reduces VAE usage by the temporal dimension (can be huge VRAM savings).

Remove the transitive import of VideoConv3d and Resnet and takes these from actual implementation source.

According to git grep, this is not used now, and was not used in the initial commit that introduced it (see below). This semantic is difficult to implement temporal roll VAE for (and would defeat the purpose). Rather than implement the complex if, just delete the unused feature. (venv) rattus@rattus-box2:~/ComfyUI$ git log --oneline 220afe3 (HEAD) Initial commit. (venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False, comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end: (venv) rattus@rattus-box2:~/ComfyUI$ git co origin/master Previous HEAD position was 220afe3 Initial commit. HEAD is now at 9d8a817 Enable async offloading by default on Nvidia. (comfyanonymous#10953) (venv) rattus@rattus-box2:~/ComfyUI$ git grep give_pre comfy/ldm/modules/diffusionmodules/model.py: resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False, comfy/ldm/modules/diffusionmodules/model.py: self.give_pre_end = give_pre_end comfy/ldm/modules/diffusionmodules/model.py: if self.give_pre_end:

Move the carrying conv op to the common VAE code and give it a better name. Roll the carry implementation logic for Resnet into the base class and scrap the Hunyuan specific subclass.

If there are no attention layers, its a standard resnet and VideoConv3d is asked for, substitute in the temporal rolloing VAE algorithm. This reduces VAE usage by the temporal dimension (can be huge VRAM savings).

If there are no attention layers, its a standard resnet and VideoConv3d is asked for, substitute in the temporal rolling VAE algorithm. This reduces VAE usage by the temporal dimension (can be huge VRAM savings).

rattus128 added 5 commits November 30, 2025 09:04

hunyuan upsampler: rework imports

43c83e5

Remove the transitive import of VideoConv3d and Resnet and takes these from actual implementation source.

move refiner VAE temporal roller to core

119fc04

Move the carrying conv op to the common VAE code and give it a better name. Roll the carry implementation logic for Resnet into the base class and scrap the Hunyuan specific subclass.

model: Add temporal roll to main VAE decoder

6571c91

If there are no attention layers, its a standard resnet and VideoConv3d is asked for, substitute in the temporal rolloing VAE algorithm. This reduces VAE usage by the temporal dimension (can be huge VRAM savings).

model: Add temporal roll to main VAE encoder

1d53c1f

If there are no attention layers, its a standard resnet and VideoConv3d is asked for, substitute in the temporal rolling VAE algorithm. This reduces VAE usage by the temporal dimension (can be huge VRAM savings).

rattus128 requested a review from Kosinkadink as a code owner November 29, 2025 23:19

yoland68 added the Core Core team dependency label Dec 2, 2025

comfyanonymous merged commit 73f5649 into comfyanonymous:master Dec 3, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement temporal rolling VAE (Major VRAM reductions in Hunyuan and Kandinsky) #10995

Implement temporal rolling VAE (Major VRAM reductions in Hunyuan and Kandinsky) #10995

Uh oh!

rattus128 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement temporal rolling VAE (Major VRAM reductions in Hunyuan and Kandinsky) #10995

Implement temporal rolling VAE (Major VRAM reductions in Hunyuan and Kandinsky) #10995

Uh oh!

Conversation

rattus128 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants