-
Notifications
You must be signed in to change notification settings - Fork 6.5k
LTX Video 0.9.8 long multi prompt #12614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
hi @yaoqih thanks for the PR, can we try turn off the tiled decoding, to see if the first image is still blurry? |
|
I experimented with disabling chunked decoding. Here is the code I used: import torch
from diffusers import LTXI2VLongMultiPromptPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel
from diffusers.pipelines.ltx.pipeline_ltx import LTXPipeline
from diffusers.utils import export_to_video
from PIL import Image
def decode_full_video(
pipeline: LTXI2VLongMultiPromptPipeline,
latents: torch.Tensor,
decode_timestep: Optional[float] = 0.05,
decode_noise_scale: Optional[float] = 0.025,
generator: Optional[torch.Generator] = None,
output_type: str = "pil",
last_frame_fix: bool = True,
auto_denormalize: bool = True,
compute_dtype: torch.dtype = torch.float32,
):
"""
Decode the full latent tensor in one pass, mirroring `vae_decode_tiled`'s color handling.
"""
if output_type == "latent":
return latents
device = pipeline._execution_device
latents = latents.to(device=device, dtype=compute_dtype)
# Match ComfyUI parity by de-normalizing the denoising latents before VAE decode
if auto_denormalize:
latents = LTXPipeline._denormalize_latents(
latents, pipeline.vae.latents_mean, pipeline.vae.latents_std, pipeline.vae.config.scaling_factor
)
latents = latents.to(dtype=pipeline.vae.dtype)
timestep = None
if getattr(pipeline.vae.config, "timestep_conditioning", False):
timestep_value = float(decode_timestep) if decode_timestep is not None else 0.0
timestep = torch.tensor([timestep_value], device=device, dtype=latents.dtype)
if decode_noise_scale is not None:
noise_scale = torch.tensor([float(decode_noise_scale)], device=device, dtype=latents.dtype)[
:, None, None, None, None
]
noise = torch.randn(latents.shape, generator=generator, device=device, dtype=latents.dtype)
latents = (1 - noise_scale) * latents + noise_scale * noise
if last_frame_fix:
latents = torch.cat([latents, latents[:, :, -1:].contiguous()], dim=2)
video = pipeline.vae.decode(latents, timestep, return_dict=False)[0]
video = video.to(dtype=compute_dtype).clamp(-1.0, 1.0).to(dtype=pipeline.vae.dtype)
if last_frame_fix:
tsf = int(pipeline.vae_temporal_compression_ratio)
video = video[:, :, :-tsf, :, :]
if output_type in ("np", "pil"):
return pipeline.video_processor.postprocess_video(video.detach(), output_type=output_type)
return video
# Stage A: Long I2V with sliding windows and multi-prompt scheduling
pipe = LTXI2VLongMultiPromptPipeline.from_pretrained(
"LTX-Video-0.9.8-13B-distilled",
torch_dtype=torch.bfloat16
).to("cuda")
schedule = "a chimpanzee walks in the jungle |a chimpanzee stops and eats a snack |a chimpanzee lays on the ground"
cond_image = Image.open("chimpanzee_l.jpg").convert("RGB")
# -- Base long video generation --
latents = pipe(
prompt=schedule,
negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
width=768,
height=512,
num_frames=361,
temporal_tile_size=120,
temporal_overlap=32,
sigmas=[1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250, 0.4219, 0.0],
guidance_scale=1.0,
cond_image=cond_image,
adain_factor=0.25,
output_type="latent",
).frames[:,:,:4,:,:]
# Decode without explicit tiling but keep memory optimizations enabled
video_pil_base = decode_full_video(
pipe,
latents,
decode_timestep=0.05,
decode_noise_scale=0.025,
output_type="pil",
)[0]
export_to_video(video_pil_base, "ltx_i2v_long_base_diffusers.mp4", fps=24)
print("Stage A: Base long video generated and saved.")ltx_i2v_long_base_diffusers.mp4I'm only decoding the latents for the first four frames because processing the entire set causes an Out of Memory (OOM) error, even on my GPU with 96GB of VRAM. As you can see, the first frame is still blurry. I suspect the issue doesn't lie with the decoding method itself. The primary reason for implementing chunked decoding was to reduce memory consumption. If the decoding process were the root cause of the blurriness, I would expect the entire video to be blurry, not just the initial frame. Today, I revisited the ComfyUI source code to debug and align my implementation. I discovered that if I disable the noise mixing that occurs between the model inference step and the scheduler (at line 1166 in my code), the blurriness in the first frame disappears. However, this introduces a new artifact: the brightness and contrast of the first frame become slightly higher than in the subsequent frames.
Here is the resulting video after disabling noise mixing: ltx_i2v_long_base_diffusers2.mp4However, I believe that disabling this feature is not the correct solution, as it is an intentional part of the implementation in ComfyUI, which can be found here: ComfyUI/comfy/samplers.py at master · comfyanonymous/ComfyUI (specifically at line 400).
|


PR: Add LTXI2VLongMultiPromptPipeline (ComfyUI-parity long I2V with multi-prompt sliding windows)
What does this PR do?
Primary implementation and docs
Motivation and context
Key features and changes
Usage example
Breaking changes
Docs updated
Before submitting
Test Case
This test aims to verify the visual parity of the
LTXI2VLongMultiPromptPipelineoutput against the ComfyUI LTXVideo plugin when configured with identical parameters.1. Test Setup
Input Image:

Core Parameter Alignment:
To ensure a fair comparison, the following key parameters were kept identical between the ComfyUI and Diffusers implementations. These are based on the ltxv-13b-i2v-long-multi-prompt.json workflow.
2. Diffusers Implementation Code
3. Results Comparison
Stage A: Base Long Video Generation
ltxv-base_00001.1.mp4
ltx_i2v_long_base.mp4
Stage B: Upsampling & Refinement
ltxv-ic-lora_00008_compressed.webm
ltx_i2v_long_refined.mp4
4. Limitation