LTX Video 0.9.8 long multi prompt #12614

yaoqih · 2025-11-08T19:33:35Z

PR: Add LTXI2VLongMultiPromptPipeline (ComfyUI-parity long I2V with multi-prompt sliding windows)

What does this PR do?

Introduces a new pipeline LTXI2VLongMultiPromptPipeline providing long-duration image-to-video generation using temporal sliding windows with multi-prompt scheduling, first-frame hard conditioning, window-tail fusion, optional AdaIN normalization, and tiled VAE decoding.
Aligns behavior with ComfyUI looping sampler semantics while following Diffusers conventions (device/generator handling, progress bar, error handling, docstrings, and API).
Adds documentation and examples to the LTX-Video docs page to showcase long I2V workflows, seeding strategy, and multi-stage refinement.

Primary implementation and docs

New/updated implementation:
- Class: LTXI2VLongMultiPromptPipeline
Docs:
- Adds a “Long image-to-video, multi-prompt sliding windows (ComfyUI parity)” example: diffusers/docs/source/en/api/pipelines/ltx_video.md

Motivation and context

Long video generation is a common need for LTX-Video users. This pipeline provides a Diffusers-native implementation that mirrors ComfyUI behavior while maintaining Diffusers style and ergonomics.
The pipeline enables multi-prompt text scheduling across temporal windows, first-frame I2V hard conditioning via per-token mask, and smooth fusion across windows for consistent motion and content.

Key features and changes

Temporal sliding windows only (no spatial sharding in denoising); autoregressive fusion across windows.
Multi-prompt segmentation per window; transitions handled at the head of each window.
First-frame hard conditioning via per-token mask when cond_image is provided.
Reference and “negative index” latent injection at window head; optional AdaIN cross-window normalization for color/contrast consistency.
Per-window timesteps reset; ability to skip steps by sigma threshold for speed.

Usage example

Basic long I2V multi-prompt run:
- See diffusers/docs/source/en/api/pipelines/ltx_video.mdfor a complete example including:
  - Multi-prompt schedule string split by “|”
  - cond_image for first-frame hard conditioning
  - Sliding window configuration (temporal_tile_size, temporal_overlap)
  - Returning latent-space video and tiled decoding via VAE
  - Optional spatial latent upsampling followed by short refinement with a compact sigma schedule

Breaking changes

None. New pipeline and docs are additive.

Docs updated

Long I2V multi-prompt example and notes: diffusers/docs/source/en/api/pipelines/ltx_video.md

Before submitting

This PR improves docs and adds a new pipeline.
Read contributor guidelines & docstring formatting guidelines.
Updated documentation with examples and autodoc entries.
Added or planned tests as suggested above (can be part of follow-up PR if needed).

Test Case

This test aims to verify the visual parity of the LTXI2VLongMultiPromptPipeline output against the ComfyUI LTXVideo plugin when configured with identical parameters.

1. Test Setup

Input Image:
Core Parameter Alignment:
To ensure a fair comparison, the following key parameters were kept identical between the ComfyUI and Diffusers implementations. These are based on the ltxv-13b-i2v-long-multi-prompt.json workflow.

2. Diffusers Implementation Code

import torch
from diffusers import LTXI2VLongMultiPromptPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel
from diffusers.utils import export_to_video
from PIL import Image

# Stage A: Long I2V with sliding windows and multi-prompt scheduling
pipe = LTXI2VLongMultiPromptPipeline.from_pretrained(
    "Lightricks/LTX-Video-0.9.8-13B-distilled",
    torch_dtype=torch.bfloat16
).to("cuda")

schedule = "a chimpanzee walks in the jungle |a chimpanzee stops and eats a snack |a chimpanzee lays on the ground"
cond_image = Image.open("chimpanzee_l.jpg").convert("RGB")

# -- Base long video generation --
latents = pipe(
    prompt=schedule,
    negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
    width=768,
    height=512,
    num_frames=361,
    temporal_tile_size=120,
    temporal_overlap=32,
    sigmas=[1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250, 0.4219, 0.0],
    guidance_scale=1.0,
    cond_image=cond_image,
    adain_factor=0.25,
    output_type="latent",
).frames

# Decode with VAE tiling to save memory
video_pil_base = pipe.vae_decode_tiled(latents, decode_timestep=0.05, decode_noise_scale=0.025, output_type="pil")[0]
export_to_video(video_pil_base, "ltx_i2v_long_base_diffusers.mp4", fps=24)
print("Stage A: Base long video generated and saved.")

# Stage B (Optional): Spatial latent upsampling + short refinement
upsampler = LTXLatentUpsamplerModel.from_pretrained("linoyts/LTX-Video-spatial-upscaler-0.9.8/latent_upsampler", torch_dtype=torch.bfloat16)
pipe_upsample = LTXLatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=upsampler).to(torch.bfloat16).to("cuda")

up_latents = pipe_upsample(
    latents=latents,
    adain_factor=1.0,
    tone_map_compression_ratio=0.6,
    output_type="latent"
).frames

# -- Load LoRA and perform refinement --
try:
    pipe.load_lora_weights(
        "Lightricks/LTX-Video-ICLoRA-detailer-13b-0.9.8/ltxv-098-ic-lora-detailer-diffusers.safetensors",
        adapter_name="ic-detailer",
    )
    pipe.fuse_lora(components=["transformer"], lora_scale=1.0)
    print("[Info] IC-LoRA detailer adapter loaded and fused.")
  
    # Short refinement pass (distilled; low steps)
    frames_refined = pipe(
        negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
        width=768,        # width/height match latents and are not scaled
        height=512,
        num_frames=up_latents.shape[2],
        temporal_tile_size=80,
        temporal_overlap=24,
        adain_factor=0.0, # Disable AdaIN in refinement
        latents=up_latents,
        guidance_latents=up_latents,
        sigmas=[0.99, 0.9094, 0.0], # Short sigma schedule
        output_type="pil",
    ).frames[0]

    export_to_video(frames_refined, "ltx_i2v_long_refined_diffusers.mp4", fps=24)
    print("Stage B: Refined video generated and saved.")

except Exception as e:
    print(f"[Warn] Failed to load IC-LoRA or run refinement: {e}. Skipping the second refinement sampling.")

3. Results Comparison

Stage A: Base Long Video Generation

ComfyUI	Diffusers (This PR)
Original Video Link ltxv-base_00001.1.mp4	Original Video Link ltx_i2v_long_base.mp4

Stage B: Upsampling & Refinement

ComfyUI	Diffusers (This PR)
Original Video Link ltxv-ic-lora_00008_compressed.webm	Original Video Link ltx_i2v_long_refined.mp4

4. Limitation

First-Frame Blurriness: The initial frame, despite being hard-conditioned on the input image, may exhibit minor blurring or a slight loss of sharpness. This is a characteristic observed in the current model behavior.
Minor Parity Differences: While this pipeline achieves strong parity in motion, content, and overall style, variations may still exist when compared to the ComfyUI reference output.

yiyixuxu · 2025-11-11T02:41:43Z

hi @yaoqih thanks for the PR, can we try turn off the tiled decoding, to see if the first image is still blurry?

yaoqih · 2025-11-11T16:16:54Z

I experimented with disabling chunked decoding. Here is the code I used:

import torch
from diffusers import LTXI2VLongMultiPromptPipeline, LTXLatentUpsamplePipeline
from diffusers.pipelines.ltx.modeling_latent_upsampler import LTXLatentUpsamplerModel
from diffusers.pipelines.ltx.pipeline_ltx import LTXPipeline
from diffusers.utils import export_to_video
from PIL import Image


def decode_full_video(
    pipeline: LTXI2VLongMultiPromptPipeline,
    latents: torch.Tensor,
    decode_timestep: Optional[float] = 0.05,
    decode_noise_scale: Optional[float] = 0.025,
    generator: Optional[torch.Generator] = None,
    output_type: str = "pil",
    last_frame_fix: bool = True,
    auto_denormalize: bool = True,
    compute_dtype: torch.dtype = torch.float32,
):
    """
    Decode the full latent tensor in one pass, mirroring `vae_decode_tiled`'s color handling.
    """
    if output_type == "latent":
        return latents

    device = pipeline._execution_device
    latents = latents.to(device=device, dtype=compute_dtype)

    # Match ComfyUI parity by de-normalizing the denoising latents before VAE decode
    if auto_denormalize:
        latents = LTXPipeline._denormalize_latents(
            latents, pipeline.vae.latents_mean, pipeline.vae.latents_std, pipeline.vae.config.scaling_factor
        )
    latents = latents.to(dtype=pipeline.vae.dtype)

    timestep = None
    if getattr(pipeline.vae.config, "timestep_conditioning", False):
        timestep_value = float(decode_timestep) if decode_timestep is not None else 0.0
        timestep = torch.tensor([timestep_value], device=device, dtype=latents.dtype)

        if decode_noise_scale is not None:
            noise_scale = torch.tensor([float(decode_noise_scale)], device=device, dtype=latents.dtype)[
                :, None, None, None, None
            ]
            noise = torch.randn(latents.shape, generator=generator, device=device, dtype=latents.dtype)
            latents = (1 - noise_scale) * latents + noise_scale * noise

    if last_frame_fix:
        latents = torch.cat([latents, latents[:, :, -1:].contiguous()], dim=2)

    video = pipeline.vae.decode(latents, timestep, return_dict=False)[0]
    video = video.to(dtype=compute_dtype).clamp(-1.0, 1.0).to(dtype=pipeline.vae.dtype)
    if last_frame_fix:
        tsf = int(pipeline.vae_temporal_compression_ratio)
        video = video[:, :, :-tsf, :, :]

    if output_type in ("np", "pil"):
        return pipeline.video_processor.postprocess_video(video.detach(), output_type=output_type)
    return video

# Stage A: Long I2V with sliding windows and multi-prompt scheduling
pipe = LTXI2VLongMultiPromptPipeline.from_pretrained(
    "LTX-Video-0.9.8-13B-distilled",
    torch_dtype=torch.bfloat16
).to("cuda")
schedule = "a chimpanzee walks in the jungle |a chimpanzee stops and eats a snack |a chimpanzee lays on the ground"
cond_image = Image.open("chimpanzee_l.jpg").convert("RGB")

# -- Base long video generation --
latents = pipe(
    prompt=schedule,
    negative_prompt="worst quality, inconsistent motion, blurry, jittery, distorted",
    width=768,
    height=512,
    num_frames=361,
    temporal_tile_size=120,
    temporal_overlap=32,
    sigmas=[1.0000, 0.9937, 0.9875, 0.9812, 0.9750, 0.9094, 0.7250, 0.4219, 0.0],
    guidance_scale=1.0,
    cond_image=cond_image,
    adain_factor=0.25,
    output_type="latent",
).frames[:,:,:4,:,:]

# Decode without explicit tiling but keep memory optimizations enabled
video_pil_base = decode_full_video(
    pipe,
    latents,
    decode_timestep=0.05,
    decode_noise_scale=0.025,
    output_type="pil",
)[0]
export_to_video(video_pil_base, "ltx_i2v_long_base_diffusers.mp4", fps=24)
print("Stage A: Base long video generated and saved.")

ltx_i2v_long_base_diffusers.mp4

I'm only decoding the latents for the first four frames because processing the entire set causes an Out of Memory (OOM) error, even on my GPU with 96GB of VRAM. As you can see, the first frame is still blurry.

I suspect the issue doesn't lie with the decoding method itself. The primary reason for implementing chunked decoding was to reduce memory consumption. If the decoding process were the root cause of the blurriness, I would expect the entire video to be blurry, not just the initial frame.

Today, I revisited the ComfyUI source code to debug and align my implementation. I discovered that if I disable the noise mixing that occurs between the model inference step and the scheduler (at line 1166 in my code), the blurriness in the first frame disappears. However, this introduces a new artifact: the brightness and contrast of the first frame become slightly higher than in the subsequent frames.

Here is the resulting video after disabling noise mixing:

ltx_i2v_long_base_diffusers2.mp4

However, I believe that disabling this feature is not the correct solution, as it is an intentional part of the implementation in ComfyUI, which can be found here: ComfyUI/comfy/samplers.py at master · comfyanonymous/ComfyUI (specifically at line 400).

LTX Video 0.9.8 long multi prompt

4981971

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LTX Video 0.9.8 long multi prompt #12614

LTX Video 0.9.8 long multi prompt #12614

Uh oh!

yaoqih commented Nov 8, 2025 •

edited

Loading

Uh oh!

yiyixuxu commented Nov 11, 2025

Uh oh!

yaoqih commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LTX Video 0.9.8 long multi prompt #12614

Are you sure you want to change the base?

LTX Video 0.9.8 long multi prompt #12614

Uh oh!

Conversation

yaoqih commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Case

1. Test Setup

2. Diffusers Implementation Code

3. Results Comparison

Stage A: Base Long Video Generation

Stage B: Upsampling & Refinement

4. Limitation

Uh oh!

yiyixuxu commented Nov 11, 2025

Uh oh!

yaoqih commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoqih commented Nov 8, 2025 •

edited

Loading