-
Notifications
You must be signed in to change notification settings - Fork 31.3k
fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 #42083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@zucchini-nlp please check this~ |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR! Can you also update the Qwen2VL video processor with the same changes?
| T = stacked_videos.shape[1] | ||
| if pad := -T % temporal_patch_size: | ||
| repeats = stacked_videos[:, -1:].expand(-1, pad, -1, -1, -1) | ||
| stacked_videos = torch.cat((stacked_videos, repeats), dim=1) | ||
| B, T, C, H, W = stacked_videos.shape | ||
| num_frames, height, width = T, H, W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is needed if we are expanding it later, just before petchifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if resize is enabled and num_frames < temporal_patch_size, resize check will throw an error: this line: https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py#L44.
so i think here is needed for cases such as one video has only one image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting. I remember we fixes similar issue with qwen2-vl series in #38706
Imo we should not be restricting the inputs to be certain size before resizing, so we can adjust the t_bar so it is never zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. How about moving this logic to sample_frames instead of _preprocess? If users sample frames themselves, they’d be responsible for ensuring at least temporal_patch_size frames—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. How about moving this logic to
sample_framesinstead of_preprocess? If users sample frames themselves, they’d be responsible for ensuring at leasttemporal_patch_sizeframes—though that’s rarely an issue since videos typically have more than two frames. By the way, when would a “video” have only one frame? Wouldn’t that just be an image? 🤔
the cases i am using is: I have videos from 7 cam views, and i want every video has different frames, maybe some view has only one frame. And i want to use video processor to preprocess these videos, so i come to this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. guys, i delete the temporal padding when num_frames < temporal_patch_size..... let this pr be simple.🫡
| def test_image_input(self): | ||
| for video_processing_class in self.video_processor_list: | ||
| video_processor_dict = self.video_processor_dict.copy() | ||
| video_processor_dict["size"] = {"longest_edge": 40960, "shortest_edge": 4096} | ||
| video_processor_dict["do_sample_frames"] = False | ||
| video_processor_dict["temporal_patch_size"] = 3 | ||
| video_processing = video_processing_class(**video_processor_dict) | ||
|
|
||
| n, w, h = 1, 64, 64 | ||
| video_inputs = [(np.random.randint(0, 256, (h, w, 3), dtype=np.uint8)) for _ in range(n)] | ||
|
|
||
| video_processed = video_processing(video_inputs, return_tensors="pt") | ||
| encoded_videos = video_processed[self.input_name] | ||
| self.assertEqual(list(encoded_videos.shape), [16, 2304]) | ||
|
|
||
| video_grid_thw = video_processed["video_grid_thw"] | ||
| self.assertEqual(video_grid_thw.tolist(), [[1, 4, 4]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of the same test as test_videos_PIL ig, so it is redundant. I think the below one for temporal patch size is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test case is: one video has just one frame. test_videos_pil is not exactly one frame. i update the test function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be missing smth, why do we need a test for single frame? The issue solved by this PR is only about temporal padding
never mind, just saw another comment above
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! Do you plan to update smart_resize in this PR or make a separate one?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@JJJYmmm it's up to you🫡. maybe just use |
|
@yaogang2060 As mentioned in #42083 (comment), you can remove the frame number check (or warning) at transformers/src/transformers/models/qwen3_vl/video_processing_qwen3_vl.py Lines 43 to 44 in 59994f0
and instead use ceil at
to ensure t_bar is never zero.
|
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen2_vl, qwen3_vl |
just removed~ |
What does this PR do?
expect qwen3vl video processor can process these two cases:
Fixes # (issue)
QwenLM/Qwen3-VL#1689