fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 #42083

yaogang2060 · 2025-11-07T09:29:27Z

What does this PR do?

expect qwen3vl video processor can process these two cases:

num_frames is 1, and sample_frames is false.
num_frames > temporal_patch_size and num_frames % temporal_patch_size != 1

Fixes # (issue)
QwenLM/Qwen3-VL#1689

yaogang2060 · 2025-11-07T09:32:49Z

@zucchini-nlp please check this~

zucchini-nlp

Thanks a lot for the PR! Can you also update the Qwen2VL video processor with the same changes?

zucchini-nlp · 2025-11-07T10:12:55Z

src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

+                T = stacked_videos.shape[1]
+                if pad := -T % temporal_patch_size:
+                    repeats = stacked_videos[:, -1:].expand(-1, pad, -1, -1, -1)
+                    stacked_videos = torch.cat((stacked_videos, repeats), dim=1)
+                B, T, C, H, W = stacked_videos.shape
+                num_frames, height, width = T, H, W


i don't think this is needed if we are expanding it later, just before petchifying

if resize is enabled and num_frames < temporal_patch_size, resize check will throw an error: this line: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py#L44.
so i think here is needed for cases such as one video has only one image.

Ah interesting. I remember we fixes similar issue with qwen2-vl series in #38706

Imo we should not be restricting the inputs to be certain size before resizing, so we can adjust the t_bar so it is never zero

Agree. How about moving this logic to sample_frames instead of _preprocess? If users sample frames themselves, they’d be responsible for ensuring at least temporal_patch_size frames—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔

Agree. How about moving this logic to sample_frames instead of _preprocess? If users sample frames themselves, they’d be responsible for ensuring at least temporal_patch_size frames—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔

the cases i am using is: I have videos from 7 cam views, and i want every video has different frames, maybe some view has only one frame. And i want to use video processor to preprocess these videos, so i come to this...

ok. guys, i delete the temporal padding when num_frames < temporal_patch_size..... let this pr be simple.🫡

src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

zucchini-nlp · 2025-11-07T10:19:52Z

tests/models/qwen3_vl/test_video_processing_qwen3_vl.py

+    def test_image_input(self):
+        for video_processing_class in self.video_processor_list:
+            video_processor_dict = self.video_processor_dict.copy()
+            video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096}
+            video_processor_dict["do_sample_frames"] = False
+            video_processor_dict["temporal_patch_size"] = 3
+            video_processing = video_processing_class(**video_processor_dict)
+
+            n, w, h = 1, 64, 64
+            video_inputs = [(np.random.randint(0, 256, (h, w, 3), dtype=np.uint8)) for _ in range(n)]
+
+            video_processed = video_processing(video_inputs, return_tensors="pt")
+            encoded_videos = video_processed[self.input_name]
+            self.assertEqual(list(encoded_videos.shape), [16, 2304])
+
+            video_grid_thw = video_processed["video_grid_thw"]
+            self.assertEqual(video_grid_thw.tolist(), [[1, 4, 4]])


Kind of the same test as test_videos_PIL ig, so it is redundant. I think the below one for temporal patch size is enough

this test case is: one video has just one frame. test_videos_pil is not exactly one frame. i update the test function name.

might be missing smth, why do we need a test for single frame? The issue solved by this PR is only about temporal padding

never mind, just saw another comment above

tests/models/qwen3_vl/test_video_processing_qwen3_vl.py

zucchini-nlp

LGTM now! Do you plan to update smart_resize in this PR or make a separate one?

HuggingFaceDocBuilderDev · 2025-11-10T12:05:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yaogang2060 · 2025-11-10T12:54:11Z

Member

LGTM now! Do you plan to update smart_resize in this PR or make a separate one?

@JJJYmmm it's up to you🫡. maybe just use smart_resize in qwen2vl?

JJJYmmm · 2025-11-10T13:26:24Z

@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at

transformers/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

Lines 43 to 44 in 59994f0

    
           if num_frames < temporal_factor: 
        
               raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")

and instead use ceil at

transformers/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

Line 53 in 59994f0

t_bar = round(num_frames / temporal_factor) * temporal_factor

to ensure t_bar is never zero.

github-actions · 2025-11-10T13:53:28Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_vl, qwen3_vl

yaogang2060 · 2025-11-10T13:53:53Z

@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at

transformers/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

Lines 43 to 44 in 59994f0

if num_frames < temporal_factor:

raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")

and instead use ceil at

transformers/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py

Line 53 in 59994f0

t_bar = round(num_frames / temporal_factor) * temporal_factor

to ensure t_bar is never zero.

just removed~

yaogang2060 added 3 commits November 7, 2025 16:17

qwen3vl video process padding video frames

357e602

add two video processor test cases

ec89ddf

fix typo

de72cd2

zucchini-nlp reviewed Nov 7, 2025

View reviewed changes

yaogang2060 added 4 commits November 8, 2025 07:01

down test image size

f91ae0f

fix qwen2vl video processor t padding

f257f08

delete padding when num_frames < temporal_patch_size

d059f09

to default format

59994f0

yaogang2060 changed the title ~~fix qwen3vl video processor temporal padding frames~~ fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 Nov 10, 2025

zucchini-nlp approved these changes Nov 10, 2025

View reviewed changes

fix smart_resize in qwen3vl

040d888

zucchini-nlp enabled auto-merge (squash) November 10, 2025 14:27

zucchini-nlp merged commit 700c48a into huggingface:main Nov 10, 2025
15 checks passed

fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 #42083

fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 #42083

Uh oh!

Conversation

yaogang2060 commented Nov 7, 2025

What does this PR do?

Uh oh!

yaogang2060 commented Nov 7, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 10, 2025

Uh oh!

yaogang2060 commented Nov 10, 2025

Uh oh!

JJJYmmm commented Nov 10, 2025

Uh oh!

github-actions bot commented Nov 10, 2025

Uh oh!

yaogang2060 commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zucchini-nlp Nov 10, 2025 •

edited

Loading