Skip to content

Conversation

@yaogang2060
Copy link
Contributor

What does this PR do?

expect qwen3vl video processor can process these two cases:

  1. num_frames is 1, and sample_frames is false.
  2. num_frames > temporal_patch_size and num_frames % temporal_patch_size != 1

Fixes # (issue)
QwenLM/Qwen3-VL#1689

@yaogang2060
Copy link
Contributor Author

@zucchini-nlp please check this~

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! Can you also update the Qwen2VL video processor with the same changes?

Comment on lines 199 to 204
T = stacked_videos.shape[1]
if pad := -T % temporal_patch_size:
repeats = stacked_videos[:, -1:].expand(-1, pad, -1, -1, -1)
stacked_videos = torch.cat((stacked_videos, repeats), dim=1)
B, T, C, H, W = stacked_videos.shape
num_frames, height, width = T, H, W
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is needed if we are expanding it later, just before petchifying

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if resize is enabled and num_frames < temporal_patch_size, resize check will throw an error: this line: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py#L44.
so i think here is needed for cases such as one video has only one image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting. I remember we fixes similar issue with qwen2-vl series in #38706

Imo we should not be restricting the inputs to be certain size before resizing, so we can adjust the t_bar so it is never zero

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. How about moving this logic to sample_frames instead of _preprocess? If users sample frames themselves, they’d be responsible for ensuring at least temporal_patch_size frames—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. How about moving this logic to sample_frames instead of _preprocess? If users sample frames themselves, they’d be responsible for ensuring at least temporal_patch_size frames—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔

the cases i am using is: I have videos from 7 cam views, and i want every video has different frames, maybe some view has only one frame. And i want to use video processor to preprocess these videos, so i come to this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. guys, i delete the temporal padding when num_frames < temporal_patch_size..... let this pr be simple.🫡

Comment on lines 332 to 348
def test_image_input(self):
for video_processing_class in self.video_processor_list:
video_processor_dict = self.video_processor_dict.copy()
video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096}
video_processor_dict["do_sample_frames"] = False
video_processor_dict["temporal_patch_size"] = 3
video_processing = video_processing_class(**video_processor_dict)

n, w, h = 1, 64, 64
video_inputs = [(np.random.randint(0, 256, (h, w, 3), dtype=np.uint8)) for _ in range(n)]

video_processed = video_processing(video_inputs, return_tensors="pt")
encoded_videos = video_processed[self.input_name]
self.assertEqual(list(encoded_videos.shape), [16, 2304])

video_grid_thw = video_processed["video_grid_thw"]
self.assertEqual(video_grid_thw.tolist(), [[1, 4, 4]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of the same test as test_videos_PIL ig, so it is redundant. I think the below one for temporal patch size is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test case is: one video has just one frame. test_videos_pil is not exactly one frame. i update the test function name.

Copy link
Member

@zucchini-nlp zucchini-nlp Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be missing smth, why do we need a test for single frame? The issue solved by this PR is only about temporal padding

never mind, just saw another comment above

@yaogang2060 yaogang2060 changed the title fix qwen3vl video processor temporal padding frames fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 Nov 10, 2025
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now! Do you plan to update smart_resize in this PR or make a separate one?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yaogang2060
Copy link
Contributor Author

Member

LGTM now! Do you plan to update smart_resize in this PR or make a separate one?

@JJJYmmm it's up to you🫡. maybe just use smart_resize in qwen2vl?

@JJJYmmm
Copy link
Contributor

JJJYmmm commented Nov 10, 2025

@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at

if num_frames < temporal_factor:
raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")

and instead use ceil at
t_bar = round(num_frames / temporal_factor) * temporal_factor

to ensure t_bar is never zero.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen2_vl, qwen3_vl

@yaogang2060
Copy link
Contributor Author

@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at

if num_frames < temporal_factor:
raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")

and instead use ceil at

t_bar = round(num_frames / temporal_factor) * temporal_factor

to ensure t_bar is never zero.

@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at

if num_frames < temporal_factor:
raise ValueError(f"t:{num_frames} must be larger than temporal_factor:{temporal_factor}")

and instead use ceil at

t_bar = round(num_frames / temporal_factor) * temporal_factor

to ensure t_bar is never zero.

just removed~

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) November 10, 2025 14:27
@zucchini-nlp zucchini-nlp merged commit 700c48a into huggingface:main Nov 10, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants