-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Support video in mtmd (base API) #16910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the implementation to be much more simple: the mtmd doesn't have to know anything about video, but instead each frame is supplied via a separated bitmap. There will be no changes to mtmd API at all. You can add a helper to convert video to bitmap, it should be inside mtmd-helper to stay outside of the core library.
As a reminder, mtmd-helper is an optional unit, outside of the core libmtmd. It can only interface with libmtmd via the public API. For example, for audio processing, loading and resampling mp3/wav/flac files is done at mtmd-helper level, and the final PCM is fed into libmtmd. There is no resampling or decoding logic inside libmtmd.
Also, unless you do temporal pooling (like what Qwen does), there is no need to modify clip.cpp
So overall, I unfortunately can't accept the current approach. However, I'm available to help if we want to simplify this.
tools/mtmd/mtmd-helper.cpp
Outdated
| return lower.rfind(".jpg") != std::string::npos || | ||
| lower.rfind(".jpeg") != std::string::npos || | ||
| lower.rfind(".png") != std::string::npos || | ||
| lower.rfind(".bmp") != std::string::npos || | ||
| lower.rfind(".gif") != std::string::npos || | ||
| lower.rfind(".webp") != std::string::npos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced by the fact that we should check for file extension. This is quite prone to error.
We can't we check by magic bytes instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! You are right. I will push the requested changes soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has already been fixed in commit ef68f2a8bbf60e.
tools/mtmd/mtmd.h
Outdated
| MTMD_API uint32_t mtmd_bitmap_get_nx (const mtmd_bitmap * bitmap); | ||
| MTMD_API uint32_t mtmd_bitmap_get_ny (const mtmd_bitmap * bitmap); | ||
| MTMD_API const unsigned char * mtmd_bitmap_get_data (const mtmd_bitmap * bitmap); | ||
| MTMD_API unsigned char * mtmd_bitmap_get_data_mutable (mtmd_bitmap * bitmap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly against this API since it is very prone to memory issues.
The data already set by mtmd_bitmap_init_from_video, why do we need to modify it? This makes the data flow very hard to keep track.
If you need to modify a bitmap object, better to simply create a brand new object (this is the base principle of functional programming)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented it this way to avoid the significant overhead from copying large video data.
My approach, as seen in convert_frames_to_bitmap, is to first allocate the uninitialized bitmap by passing nullptr to mtmd_bitmap_init_from_video. Then, I get a mutable pointer using mtmd_bitmap_get_data_mutable and populate the data in place.
I'm definitely open to feedback, so if you have a suggestion for a better or more idiomatic approach, I would greatly appreciate the guidance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reality, LLM never process the video as a whole, so having one single mtmd_bitmap that stores the whole video is not a good approach.
Even Qwen model only process 2 frames at once, not the whole video. Audio encoders like Whisper only process 30s of audio at once (and in mtmd we do break up the audio into segments of 30s, each mtmd_bitmap contains exactly 30s of audio). So for video it should follow the same logic.
For minicpm-v, it seems like each frame of video is independent to the other, so the most intuitive is still one mtmd_bitmap per frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, using one mtmd_bitmap per frame also be better for a large video, as you can stream the processing instead of having to store the whole (decoded bitmap) video in memory.
Ofc we don't yet support this kind of streamed processing, but it can be something to added in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for sharing your perspective. I understand your concern that adding the mtmd_bitmap_get_data_mutable interface might introduce design overhead or deviate from the established coding conventions.
My personal view was that the performance gain from avoiding the extra memory copy justified this trade-off.
However, if you feel that the cost of the memory copy is more acceptable than the cost of compromising the design principle, I am happy to modify it as you requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you sure that this function will eliminate copy?
This is the place where you use it:
for (size_t i = 0; i < files.size(); i++) {
if (i % stride != 0) continue;
const std::string & f = files[i];
if (!mtmd_helper::has_image_ext(f)) continue;
mtmd::bitmap bmp(mtmd_helper_bitmap_init_from_file(ctx, f.c_str()));
if (!bmp.ptr) continue;
if(loaded==0){
w = bmp.nx();
h = bmp.ny();
out_frames = mtmd_bitmap_init_from_video(w, h, nframes, nullptr);
dest = mtmd_bitmap_get_data_mutable(out_frames);
}
GGML_ASSERT(bmp.nx() == w && bmp.ny() == h); // all frames must have the same size
std::memcpy(dest,
bmp.data(),
bmp.n_bytes());
dest += bmp.n_bytes();
loaded++;
if (loaded >= nframes) break;
}Notice that std::memcpy, you're copying the memory.
What I suggest is to instead of copying the memory, allocate multiple mtmd_bitmap_init_from_video like this:
unsigned char * frame_offset = nx*ny*i; // i is the index of current frame in the video
mtmd_bitmap_init_from_video(..., frame_data + frame_offset);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think more about how the API should be extended to support video input, will let you know when I'm ready.
This technique requires handling multiple video frames at once
Currently for minicpm-v 4.5, how many frames are being processed together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you sure that this function will eliminate copy?
You can check the code in convert_frames_to_bitmap; the copy overhead is avoided there.
The overhead is not avoided here because it reuses the mtmd_helper_bitmap_init_from_file function. And the overhead here is only a single frame's copy overhead, which is small compared to the video copy overhead within convert_frames_to_bitmap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for minicpm-v 4.5, how many frames are being processed together?
I will follow up with more information this coming Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overhead is not avoided here because it reuses the
mtmd_helper_bitmap_init_from_file function. And the overhead here is only a single frame's copy overhead, which is small compared to the video copy overhead withinconvert_frames_to_bitmap.
Maybe I can change AV_PIX_FMT_RGBA to AV_PIX_FMT_RGB24 in decode_video_ffmpeg_to_rgba_from_format_ctx to aviod RGBA->RGB 🤔
| res.overview_size = clip_image_size{slice_size, slice_size}; | ||
| res.refined_size = clip_image_size{0, 0}; | ||
| res.grid_size = clip_image_size{0, 0}; | ||
| if (clip_is_minicpmv(ctx)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the change here compared to the initial version?
IMO I think this is more like a duplicated logic, see clip_image_preprocess on how this is used by minicpm-v
| // legacy | ||
| bool has_llava_projector = false; | ||
| int minicpmv_version = 0; | ||
| int minicpmv_max_slice_nums = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search for max_slice_nums, it's already present in the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sole purpose of this file is to convert video into images, it should be outside of the core mtmd library.
The core library only need to care about what is an image. It doesn't care what is a video. If we get a video, we can use mtmd-helper (which is outside of the core) to convert video into images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be renamed to mtmd-helper-video.cpp, or even better, move all the code to mtmd-helper.cpp
Also, since this is helper functions, it should never use internal headers like clip.h or clip-impl.h. It can only use the public header mtmd.h
| mtmd_bitmap * bitmap = new mtmd_bitmap; | ||
| bitmap->nx = nx; | ||
| bitmap->ny = ny; | ||
| bitmap->nz = nframes; | ||
| bitmap->type = mtmd_bitmap_type::VIDEO; | ||
| size_t data_size = (size_t)nx * ny * nframes * 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an image is just multiple frames grouped together, why don't we simply create multiple mtmd_bitmap, one for each frame?
|
Hi @ngxson , Thank you so much for your detailed review and all the valuable suggestions. I truly appreciate the time you've taken. I will follow up with a more detailed discussion and implement the necessary revisions based on your feedback next week. In the meantime, I'd like to offer some brief clarification on a couple of points:
Thank you again for your guidance! |
In this case, I think it's better to firstly figure out how to re-enable real batching for certain models (probably starts with minicpm-v). I removed batching support because it was tricky to maintain, but we will eventually need to bring it back. Batching is needed if we want to do pooling multiple frames at once. I think pooling multiple frame at once will be pretty much the norm for video processing for future models, so it's worth to support it. The video can be broken down into groups of frames, for example if the video contains 10 frames and the model support temporal pooling of 2, we will end up having 10/2=5 chunks. Btw, could you point me to the python reference code of minicpm-v 4.5, where multiple frames are merged together? |
I agree that supporting multi-frame processing is the correct goal. Regarding the "batching" approach, I was slightly concerned that a generic batching mechanism might make the API's intent ambiguous—specifically, how to distinguish between a "batch of unrelated images" and a "sequence of video frames," which require different handling. To keep the code clear and simple for this specific video use-case, I added the flexible third dimension Moreover, the meaning of |
Hmm then how do we "merge" the information from consecutive frames? I imagine you're processing multiple frames in batch, then after the final self-attention layer, you merge some embedding vectors together using something like AvgPool2D or Conv2D Ofc we will need to add a param to indicate if the images are "sequences of video frame" or just normal unrelated images. I don't think we can escape this.
On second thought, I think yes we can add this type of Finally, to avoid copying memory, we can initialize I'll try to draft a "placeholder" version of the API for this. Will let you know when it's ready. |
|
Possibly relevant, are the batches treated separately or does the model use something closer to a sliding window (potentially requiring images from 2 consecutive batches)? |
Thank you for the ongoing discussion.Just to clarify the approach @tc-mb and I discussed, we're planning to implement this in two steps:
My implementation idea to support this is:
One important clarification: "video" in this context does not necessarily mean a full, complete video file. It can also refer to a video segment or clip, where the length (number of frames) is a multiple of the number of frames the model processes at one time just like what you say.
My personal view is that for this specific problem, the simplicity, effectiveness, and performance of the solution should have a higher priority than strict adherence to a particular coding style or design pattern. This is why I preferred the As a separate, long-term point: I have also been planning how to best allow the decoder program to stream the video frames, like discussions on how to design omni model My intended approach is to segment the video stream into small "video" clips (i.e., |
tools/mtmd/mtmd.h
Outdated
|
|
||
| // set MiniCPM-V UHD slicing upper bound (used when preprocessing images for MiniCPM-V) | ||
| // values < 1 will be clamped to 1 | ||
| MTMD_API void mtmd_set_minicpmv_max_slice_nums(mtmd_context * ctx, int n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to note that we never accept creating model-specific public API like this one.
Also, number of slices (as well as the slicing process) is completely managed by mtmd_tokenize, the should never leak this info to the public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has already been fixed in commit ef68f2a8bbf60e.
tools/mtmd/mtmd.cpp
Outdated
| enum class mtmd_bitmap_type { | ||
| IMAGE, | ||
| AUDIO, | ||
| VIDEO, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the convention of enum in llama.cpp is UPPERCASE_PREFIX_VALUE
Example: MTMD_BITMAP_TYPE_IMAGE, we don't use enum namespace ::
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has already been fixed in commit ef68f2a8bbf60e.
tools/mtmd/CMakeLists.txt
Outdated
| option(MTMD_MAX_VIDEO_FRAMES_SMALL "Set a small number of frames for fast test locally" OFF) | ||
| if(MTMD_MAX_VIDEO_FRAMES_SMALL) | ||
| target_compile_definitions(mtmd PRIVATE MTMD_MAX_VIDEO_FRAMES_SMALL) | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be added as an env var instead. See MTMD_DEBUG_GRAPH as an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has already been fixed in commit ef68f2a8bbf60e.
tools/mtmd/mtmd-helper.h
Outdated
| namespace mtmd_helper{ | ||
| bool has_image_ext(const std::string & name); | ||
| bool is_dir(const std::string & path); | ||
| void list_files(const std::string & dir, std::vector<std::string> & out, bool recursive); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is suppose to contain C++ wrapper for the functions above
The functions you're adding here does not wrap anything above, they should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue has already been fixed in commit ef68f2a8bbf60e.
|
@MrAMS @tc-mb I address my comments in this PR: ngxson#32 I would suggest you rebase your implementation to using my PR above. Please let me know if I missed anything. Please note that, in order to make the code review easier, I prefer adding functions progressively, each in a dedicated PR. For now, the current PR is about the base API (or the foundation as @MrAMS suggested). This includes the most essential infra for video processing, including the new bitmap type, helpers to decode the video file into bitmap. Things like: loading a list of image files, model-specific changes, etc are out-of-scope. They should be added later, not now. Also note that, we should only aim for supporting |
30b0aaa to
c5e9151
Compare
c5e9151 to
ef68f2a
Compare
|
Hi @ngxson, Thank you so much for your incredibly detailed and constructive feedback, and especially for providing the demo. We discussed it, and given the significant effort required to migrate and re-debug (as well as the internal development we've already based on this branch), we have opted to continue improving this PR based on your suggestions. We want to thank you again for the example project. It was very insightful and directly inspired several of our changes, such as dropping We have just pushed an updated version that comprehensively addresses your review. The main changes include:
There are some points we'd like to clarify: @tc-mb and I have re-confirmed, and the code in Regarding the topic of streaming support, I've introduced Finally, thank you again for your thorough review and all the guidance. We would be very grateful if this PR could be merged into master once you are satisfied. This will provide the stable foundation we need for our second, follow-up PR, which will add support for the "new processing technique" (speeding up video understanding by handling multiple video frames at once). Please let us know if you have any further feedback. We are happy to address any remaining points. |
I don't mean Since I still need the second PR to fully understand how the API and the code base should be structured, I think it's safer if you can just push everything to this PR. There is no need to create a 2nd PR. Please note that I won't merge your PR, but in parallel I will make a more cleaner version (like ngxson#32) which will eventually be merged to the main project. This way, you can freely implement what you want. I can make the code looks good, and we can also save our time from having to go back and forth between us. I hope you understand this, but as the core maintainer of the library, I have the responsibility to maintain a code base that will work with all kinds of models: Qwen, MiniCPM, Llama, Gemma, etc. Therefore, I cannot risk adding too many model-specific functionalities - they should be generalize to work cross multiple models. So I hope we can do what I suggested above to make things go smoother. If this sounds good to you, I'll leave you freely continue with this PR. You and ping me when you're done and need my review. |
|
Hi, I think there's been a bit of a misunderstanding, let me explain. I didn't expect that just over the weekend, something like this could have happened. I was busy with Omni's development and didn't see this issue and clarify it immediately. -_- This branch is for our ongoing development; we still need to develop some features to ensure support for the high refresh rate capabilities of MiniCPM-V 4.5. My intern just wanted to see the differences from the main branch, but was a bit inexperienced, so he committed to the official llama.cpp branch. Regarding the video interface, server support, and MiniCPM-V modifications, we added many features, and it shouldn't all be merged into one pull request. My suggestion is to wait until development is complete, confirm accuracy, and then, based on the number of added features, separately split and merge them into the llama.cpp branch. @MrAMS As I said, we haven't finished development yet and shouldn't submit a pull request now. You can close this pull request. @ngxson The multimodal model is starting to move towards video understanding and more features later. Although this PR was a bit of a misunderstanding, perhaps we can start considering the video interface. We can extract the basic video capabilities and submit a simple PR. Or, as you mentioned above, you can refer to this branch and complete a PR yourself. We think both are acceptable, and we are open to hearing your opinion. |
|
@tc-mb thanks for the clarification. Re. your point about considering the video API, yes I definitely want to add it as many models are now supporting video input. We just need to make sure the same infrastructure can be share among models, as I explained earlier.
Since I already had enough understanding of your direction, I think you can continue development on this branch. I'll mirror it as a PR on my fork at https://github.com/ngxson/llama.cpp ; you can always ping me when you're ready. I think this will be the better solution for both of us, as it allow you to freely make changes, while at the same time, I can continue to develop the video infrastructure that will be compatible with may other models like Qwen (probably while doing so, I will also add video support for Qwen model) And finally, yes I'm willing to help OpenBMB/MiniCPM team to bring this feature to llama.cpp. This has been a highly requested feature and I think it's now time to finally make it happen. |
|
Hi @ngxson A significant portion of the implementation in this PR ngxson#32 is directly based on my code from ggml-org#16910, including most of the bitmap To ensure accurate attribution, could you please add the following co-author tags to the final merge commit message? (See: Creating a commit with multiple authors) Thank you |
Purpose: Add foundational video support to
mtmdThis PR introduces the basic infrastructure required to support video processing in
mtmd.The primary motivation is to enable new models like
minicpm-v 4.5, which require a processing technique that handles multiple video frames at once. In these models, the frames are no longer independent during the "eval chunk" step.Implementation:
nzDimension: Adds a flexible third dimension (nz) to the mtmdbitmapstructure to support this new multi-frame data with minimal changes.DecodedFramesQueueto manage frames, laying the groundwork for future streaming scenarios (e.g., a producer/consumer model).This PR serves as the necessary foundation for our follow-up work, which will implement the full
minicpm-v 4.5multi-frame processing technique.