[Video Generation] Text2Video Pipeline #2991

likholat · 2025-11-10T12:04:27Z

Description

LTX Text2Video

Continuation of #2982

CVS-164653

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

.member requires C++20

Copilot

Pull Request Overview

This PR implements a Text2Video pipeline for LTX-Video model, building upon previous work. The implementation includes video generation capabilities with text-to-video conversion, model wrappers for the transformer and VAE components, configuration management, and sample applications demonstrating usage.

Key Changes:

Adds Text2VideoPipeline class and related video generation infrastructure
Implements LTX-Video specific models (transformer and VAE decoder)
Extends T5EncoderModel to support attention masks and tokenization parameters
Adds video similarity utility and sample applications

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/cpp/src/video_generation/text2video_pipeline.cpp	Core pipeline implementation with latent packing/unpacking and video post-processing
src/cpp/src/video_generation/ltx_video_transformer_3d_model.cpp	Wrapper for LTX video transformer model
src/cpp/src/video_generation/autoencoder_kl_ltx_video.cpp	VAE decoder implementation for video
src/cpp/src/image_generation/models/t5_encoder_model.cpp	Extended to support attention masks and custom tokenization parameters
src/cpp/include/openvino/genai/video_generation/generation_config.hpp	Video-specific generation configuration
samples/cpp/video_generation/text2video.cpp	Sample application demonstrating video generation
video_similarity.py	Utility for computing video similarity metrics

Comments suppressed due to low confidence (1)

src/cpp/src/video_generation/text2video_pipeline.cpp:1

String comparison operands are reversed. Should be if __name__ == \"__main__\":

// Copyright (C) 2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/video_generation/text2video_pipeline.cpp

src/cpp/include/openvino/genai/video_generation/generation_config.hpp

src/cpp/src/image_generation/models/t5_encoder_model.cpp

samples/python/video_generation/ltx-video.py

samples/python/video_generation/video_similarity.py

Copilot · 2025-11-18T08:03:42Z

src/cpp/src/image_generation/schedulers/flow_match_euler_discrete.cpp


 #include "image_generation/numpy_utils.hpp"
 #include "utils.hpp"
+#include "debug_utils.hpp"


[nitpick] Debug utility header included in production code. Consider removing this include if debug_utils.hpp is only intended for debugging purposes and not needed in production builds.

Suggested change

#include "debug_utils.hpp"

Copilot · 2025-11-18T08:03:43Z

src/python/py_image_generation_models.cpp

+                    tokenization_params
+                );
+            },
            py::call_guard<py::gil_scoped_release>(), 


Redundant GIL release guard. The lambda already releases the GIL at line 193, making this call_guard unnecessary and potentially incorrect. Remove this line.

Suggested change

py::call_guard<py::gil_scoped_release>(),

likholat · 2025-11-18T17:43:26Z

samples/python/video_generation/ltx-video.py

@@ -0,0 +1,56 @@
+import argparse


This file will be removed in the final version

likholat · 2025-11-18T17:44:01Z

samples/python/video_generation/video_similarity.py

@@ -0,0 +1,78 @@
+# Based on https://huggingface.co/docs/transformers/main/model_doc/xclip#transformers.XCLIPModel.get_video_features


This script will be moved to wwb in the final version

Copilot

Pull Request Overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 14 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/image_generation/models/t5_encoder_model.cpp:1

The TODO suggests skipping attention mask filling when not needed by the pipeline. Implement a mechanism to conditionally compute the attention mask only when required to avoid unnecessary computational overhead.

// Copyright (C) 2023-2025 Intel Corporation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/video_generation/text2video_pipeline.cpp

src/cpp/src/image_generation/models/t5_encoder_model.cpp

Copilot · 2025-11-19T06:47:29Z

samples/cpp/video_generation/text2video.cpp

+    // Compare with https://github.com/Lightricks/LTX-Video
+    // TODO: Test GPU, NPU, HETERO, MULTI, AUTO, different steps on different devices
+    // TODO: describe algo to generate a video in docs and docstrings
+    // TODO: explain in docstrings available perf metrics
+    // scheduler needs extra dim?
+    // Present that will update validation tools later
+    // new classes LTXVideoTransformer3DModel AutoencoderKLLTXVideo
+    // private copy constructors to implement clone()
+    // const VideoGenerationConfig& may outlive VideoGenerationConfig?
+    // Move negative_prompt to Property
+    // Allow selecting different models to export from optimum-intel, for example ltxv-2b-0.9.8-distilled.safetensors
+    // LoRA later: https://huggingface.co/Lightricks/LTX-Video-ICLoRA-depth-13b-0.9.7, https://huggingface.co/Lightricks/LTX-Video-ICLoRA-pose-13b-0.9.7, https://huggingface.co/Lightricks/LTXV-LoRAs Check https://github.com/Lightricks/LTX-Video for updates
+    // Wasn't need so far so not going to implement:
+    //     OVLTXPipeline allows prompt_embeds and prompt_attention_mask instead of prompt; Same for negative_prompt_embeds and negative_prompt_attention_mask
+    //     OVLTXPipeline allows batched generation with multiple prompts
+    // Tests:
+    //     Functional
+    //     Sample
+    // Cover all config members in sample. Use default values explicitly
+    // Prefer patching optimum-intel to include more stuff into a model instead of implementing it in C++
+    // Add video-to-video, inpainting
+    // image to video described in https://huggingface.co/Lightricks/LTX-Video (class LTXConditionPipeline)
+    // Optimum doesn't have LTXLatentUpsamplePipeline class
+    // Controlled video from https://github.com/Lightricks/LTX-Video
+    // TODO: decode, perf metrics, set_scheduler, set/get_generation_config, reshape, compile, clone()
+    // TODO: Rename image->video everywhere
+    // TODO: test multiple videos per prompt
+    // TODO: test with different config values
+    // TODO: test log prompts to check truncation
+    // TODO: throw if num_frames isn't devisable by 8 + 1. Similar value for resolution. The model works on resolutions that are divisible by 32 and number of frames that are divisible by 8 + 1 (e.g. 257). The model works best on resolutions under 720 x 1280 and number of frames below 257.
+    // OVLTXPipeline()(num_inference_steps=1) fails. 2 passes. Would be nice to avoid that bug in genai.
+    // Verify tiny resolution like 32x32
+    const std::string device = "CPU";  // GPU can be used as well


Extensive TODO comments in the sample file should be addressed or moved to a proper issue tracker. Sample files should demonstrate clean, production-ready usage patterns without extensive inline TODO lists.

Suggested change

// Compare with https://github.com/Lightricks/LTX-Video

// TODO: Test GPU, NPU, HETERO, MULTI, AUTO, different steps on different devices

// TODO: describe algo to generate a video in docs and docstrings

// TODO: explain in docstrings available perf metrics

// scheduler needs extra dim?

// Present that will update validation tools later

// new classes LTXVideoTransformer3DModel AutoencoderKLLTXVideo

// private copy constructors to implement clone()

// const VideoGenerationConfig& may outlive VideoGenerationConfig?

// Move negative_prompt to Property

// Allow selecting different models to export from optimum-intel, for example ltxv-2b-0.9.8-distilled.safetensors

// LoRA later: https://huggingface.co/Lightricks/LTX-Video-ICLoRA-depth-13b-0.9.7, https://huggingface.co/Lightricks/LTX-Video-ICLoRA-pose-13b-0.9.7, https://huggingface.co/Lightricks/LTXV-LoRAs Check https://github.com/Lightricks/LTX-Video for updates

// Wasn't need so far so not going to implement:

// OVLTXPipeline allows prompt_embeds and prompt_attention_mask instead of prompt; Same for negative_prompt_embeds and negative_prompt_attention_mask

// OVLTXPipeline allows batched generation with multiple prompts

// Tests:

// Functional

// Sample

// Cover all config members in sample. Use default values explicitly

// Prefer patching optimum-intel to include more stuff into a model instead of implementing it in C++

// Add video-to-video, inpainting

// image to video described in https://huggingface.co/Lightricks/LTX-Video (class LTXConditionPipeline)

// Optimum doesn't have LTXLatentUpsamplePipeline class

// Controlled video from https://github.com/Lightricks/LTX-Video

// TODO: decode, perf metrics, set_scheduler, set/get_generation_config, reshape, compile, clone()

// TODO: Rename image->video everywhere

// TODO: test multiple videos per prompt

// TODO: test with different config values

// TODO: test log prompts to check truncation

// TODO: throw if num_frames isn't devisable by 8 + 1. Similar value for resolution. The model works on resolutions that are divisible by 32 and number of frames that are divisible by 8 + 1 (e.g. 257). The model works best on resolutions under 720 x 1280 and number of frames below 257.

// OVLTXPipeline()(num_inference_steps=1) fails. 2 passes. Would be nice to avoid that bug in genai.

// Verify tiny resolution like 32x32

const std::string device = "CPU"; // GPU can be used as well

// Set device to CPU; GPU can be used as well

const std::string device = "CPU";

samples/cpp/video_generation/text2video.cpp

Copilot · 2025-11-19T06:47:30Z

src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp

+        size_t patch_size = 4;  // TODO: read from vae_decoder/config.json
+        std::vector<bool> spatio_temporal_scaling{true, true, true, false};  // TODO: read from vae_decoder/config.json. I use it only to compute sum over it so far, so it may be removed
+        size_t patch_size_t = 1;  // TODO: read from vae_decoder/config.json


Configuration parameters have TODO comments indicating they should be read from config.json. These should be properly loaded from the configuration file rather than using hardcoded defaults.

Copilot · 2025-11-19T06:47:30Z

src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp

+        // latents_mean = torch.zeros((latent_channels,), requires_grad=False)
+        // latents_std = torch.ones((latent_channels,), requires_grad=False)
+        std::vector<float> latents_mean_data; // TODO: set default value
+        std::vector<float> latents_std_data;  // TODO: set default value
+


Configuration parameters latents_mean_data and latents_std_data have TODO comments about default values. These should either have proper defaults set or be marked as required configuration.

Suggested change

// latents_mean = torch.zeros((latent_channels,), requires_grad=False)

// latents_std = torch.ones((latent_channels,), requires_grad=False)

std::vector<float> latents_mean_data; // TODO: set default value

std::vector<float> latents_std_data; // TODO: set default value

std::vector<float> latents_mean_data;

std::vector<float> latents_std_data;

Config() :

latents_mean_data(latent_channels, 0.0f),

latents_std_data(latent_channels, 1.0f)

{}

Copilot · 2025-11-19T06:47:30Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+        //TODO: move to compute_hidden_states
+        ov::Tensor rope_interpolation_scale(ov::element::f32, {3});


The TODO suggests moving rope_interpolation_scale computation to compute_hidden_states. This would improve code organization by grouping related hidden state computations together.

Wovchena · 2025-11-24T12:58:34Z

@sgonorov, please review

sgonorov

Looks good, but better rebase it on the latest master.

sgonorov · 2025-11-26T11:08:09Z

src/cpp/src/video_generation/ltx_video_transformer_3d_model.cpp

+    auto filtered_properties = extract_adapters_from_properties(properties, &adapters);
+    OPENVINO_ASSERT(!adapters, "Adapters are not currently supported for Video Generation Pipeline.");
+    ov::CompiledModel compiled_model = utils::singleton_core().compile_model(m_model, device, *filtered_properties);
+    ov::genai::utils::print_compiled_model_properties(compiled_model, "Flux Transformer 2D model");


Suggested change

ov::genai::utils::print_compiled_model_properties(compiled_model, "Flux Transformer 2D model");

ov::genai::utils::print_compiled_model_properties(compiled_model, "LTX Video Transformer 3D model");

sgonorov · 2025-11-26T11:14:33Z

src/cpp/src/video_generation/generation_config.cpp

+    ImageGenerationConfig::validate();
+}
+
+void VideoGenerationConfig::update_generation_config(const ov::AnyMap& properties) {


We should also load num_videos_per_prompt here.

sgonorov · 2025-11-26T11:17:33Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+
+using namespace ov::genai;
+
+namespace {


Maybe it's better to put these utilities into a separate file? text2video_pipeline.cpp looks quite huge already.

sgonorov · 2025-11-26T11:19:21Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+    std::shared_ptr<LTXVideoTransformer3DModel> m_transformer;
+    std::shared_ptr<AutoencoderKLLTXVideo> m_vae;
+    VideoGenerationPerfMetrics m_perf_metrics;
+    double m_latent_timestep = -1.0;  // TODO: float?


I think it should be removed - i don't see any usages.

sgonorov · 2025-11-26T11:27:46Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+                const float* noise_pred_text = noise_pred_uncond + noisy_residual_tensor.get_size();
+
+                for (size_t i = 0; i < noisy_residual_tensor.get_size(); ++i) {
+                    noisy_residual[i] = noise_pred_uncond[i] +


We can use std::fma here for clarity.

sgonorov · 2025-11-26T12:08:22Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+    double m_latent_timestep = -1.0;  // TODO: float?
+    Ms m_load_time;
+
+    size_t m_latent_num_frames = -1;


Are we sure want to assign -1 to size_t here? Maybe it's better to replace it with std::optional?

sgonorov · 2025-11-26T12:14:34Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+        std::function<bool(size_t, size_t, ov::Tensor&)> callback;
+        auto callback_iter = properties.find(ov::genai::callback.name());
+        if (callback_iter != properties.end()) {
+            callback = callback_iter->second.as<std::function<bool(size_t, size_t, ov::Tensor&)>>();


I may be wrong, but i don't see any callback usages down the code. Should we call it somehere?

sgonorov · 2025-11-26T12:16:19Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+        config.max_sequence_length = LTX_VIDEO_DEFAULT_CONFIG.max_sequence_length;
+    }
+    if (std::isnan(config.guidance_rescale)) {
+        config.guidance_rescale = LTX_VIDEO_DEFAULT_CONFIG.guidance_rescale;


Is this parameter used? Maybe it's better to remove it for now?

sgonorov · 2025-11-26T12:18:31Z

src/cpp/src/video_generation/text2video_pipeline.cpp

+    OPENVINO_ASSERT(generation_config.height % 32 == 0, "Height have to be divisible by 32 but got ", generation_config.height);
+    OPENVINO_ASSERT(generation_config.width > 0, "Width must be positive");
+    OPENVINO_ASSERT(generation_config.width % 32 == 0, "Width have to be divisible by 32 but got ", generation_config.width);
+    OPENVINO_ASSERT(1.0f == generation_config.strength, "Strength isn't applicable. Must be set to the default 1.0");


This should be checked using delta instead of == which may fail.
Also this check is repeated on line 89.

sgonorov · 2025-11-26T12:19:57Z

samples/cpp/video_generation/imwrite_video.cpp

+        // rcFrame (4 * 16-bit) -> left, top, right, bottom
+        writer.write_u16(0); // left
+        writer.write_u16(0); // top
+        writer.write_u16(static_cast<uint16_t>(w)); // right


We can add bounds check here just in case.

Wovchena and others added 30 commits November 7, 2025 00:31

Add text2video

ca46149

fix compilation

bd2147c

Add configs

4cac671

Infer text encoder

03a3214

timesteps

523e1b7

style

507243a

floats

ce2fc1a

floats

01860a9

OPENVINO_SUPPRESS_DEPRECATED_START

382c95a

brackets

c3f0f23

print

bfe8b75

transformer info

5e45023

add header

9333f2c

VideoGenerationConfig

5bc47e2

rm namepsace

c0955de

video

9877050

notes

fb7779c

update todos

7979274

add ltx-video.py

ff5142d

Add LTXPipeline sample

4bfcb56

print

9210546

fix compilation

760fb1a

comma

ebc2af7

ltx_pipeline_output

ee5024d

namepsace

39dc772

ms

1734154

apply todo

72c218e

fix

3641733

promp attention mask supported

167a115

bad transformer output

94a5025

Wovchena and others added 4 commits November 7, 2025 00:57

Fix Windows compilation

f508230

.member requires C++20

text2video header update

7e10c99

align optimum and genai outputs

fe96548

fixes after rebase

e9a9525

likholat requested a review from Wovchena November 10, 2025 12:04

likholat self-assigned this Nov 10, 2025

likholat mentioned this pull request Nov 10, 2025

Text2Video pipeline #2982

Closed

3 tasks

likholat requested a review from rkazants November 10, 2025 12:08

Wovchena requested a review from Copilot November 18, 2025 08:01

Copilot AI reviewed Nov 18, 2025

View reviewed changes

Wovchena requested a review from sgonorov November 18, 2025 13:21

likholat added 2 commits November 18, 2025 18:10

decode method

72e2955

Merge remote-tracking branch 'origin/master' into new_ltx_text2video

5d8b19f

likholat commented Nov 18, 2025

View reviewed changes

codestyle fixes

fc19f79

likholat marked this pull request as ready for review November 18, 2025 17:51

Wovchena requested a review from Copilot November 19, 2025 06:43

Copilot AI reviewed Nov 19, 2025

View reviewed changes

likholat added 3 commits November 21, 2025 20:41

rm debug methods

0914ece

codestyle fixes

23d5c36

codestyle fixes

c04ee06

sgonorov requested changes Nov 26, 2025

View reviewed changes

		@@ -0,0 +1,78 @@
		# Based on https://huggingface.co/docs/transformers/main/model_doc/xclip#transformers.XCLIPModel.get_video_features

		//TODO: move to compute_hidden_states
		ov::Tensor rope_interpolation_scale(ov::element::f32, {3});

	ov::genai::utils::print_compiled_model_properties(compiled_model, "Flux Transformer 2D model");
	ov::genai::utils::print_compiled_model_properties(compiled_model, "LTX Video Transformer 3D model");

[Video Generation] Text2Video Pipeline #2991

Are you sure you want to change the base?

[Video Generation] Text2Video Pipeline #2991

Conversation

likholat commented Nov 10, 2025

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Wovchena commented Nov 24, 2025

Uh oh!

sgonorov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels