Skip to content

Conversation

@MrAMS
Copy link

@MrAMS MrAMS commented Nov 1, 2025

Purpose: Add foundational video support to mtmd

This PR introduces the basic infrastructure required to support video processing in mtmd.

The primary motivation is to enable new models like minicpm-v 4.5, which require a processing technique that handles multiple video frames at once. In these models, the frames are no longer independent during the "eval chunk" step.

Implementation:

  • FFmpeg Integration: Leverages FFmpeg as a new optional dependency to decode video files and extract individual frames.
  • nz Dimension: Adds a flexible third dimension (nz) to the mtmd bitmap structure to support this new multi-frame data with minimal changes.
  • Explicit Video Type: Introduces explicit video-related flags to allow the API to distinguish a "sequence of video frames" from a "batch of unrelated images."
  • Streaming Foundation: Adds a DecodedFramesQueue to manage frames, laying the groundwork for future streaming scenarios (e.g., a producer/consumer model).

This PR serves as the necessary foundation for our follow-up work, which will implement the full minicpm-v 4.5 multi-frame processing technique.

@MrAMS MrAMS changed the title Support video in mtmd Support video in mtmd and server Nov 1, 2025
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the implementation to be much more simple: the mtmd doesn't have to know anything about video, but instead each frame is supplied via a separated bitmap. There will be no changes to mtmd API at all. You can add a helper to convert video to bitmap, it should be inside mtmd-helper to stay outside of the core library.

As a reminder, mtmd-helper is an optional unit, outside of the core libmtmd. It can only interface with libmtmd via the public API. For example, for audio processing, loading and resampling mp3/wav/flac files is done at mtmd-helper level, and the final PCM is fed into libmtmd. There is no resampling or decoding logic inside libmtmd.

Also, unless you do temporal pooling (like what Qwen does), there is no need to modify clip.cpp

So overall, I unfortunately can't accept the current approach. However, I'm available to help if we want to simplify this.

Comment on lines 484 to 489
return lower.rfind(".jpg") != std::string::npos ||
lower.rfind(".jpeg") != std::string::npos ||
lower.rfind(".png") != std::string::npos ||
lower.rfind(".bmp") != std::string::npos ||
lower.rfind(".gif") != std::string::npos ||
lower.rfind(".webp") != std::string::npos;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced by the fact that we should check for file extension. This is quite prone to error.

We can't we check by magic bytes instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! You are right. I will push the requested changes soon.

Copy link
Author

@MrAMS MrAMS Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has already been fixed in commit ef68f2a8bbf60e.

MTMD_API uint32_t mtmd_bitmap_get_nx (const mtmd_bitmap * bitmap);
MTMD_API uint32_t mtmd_bitmap_get_ny (const mtmd_bitmap * bitmap);
MTMD_API const unsigned char * mtmd_bitmap_get_data (const mtmd_bitmap * bitmap);
MTMD_API unsigned char * mtmd_bitmap_get_data_mutable (mtmd_bitmap * bitmap);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly against this API since it is very prone to memory issues.

The data already set by mtmd_bitmap_init_from_video, why do we need to modify it? This makes the data flow very hard to keep track.

If you need to modify a bitmap object, better to simply create a brand new object (this is the base principle of functional programming)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented it this way to avoid the significant overhead from copying large video data.

My approach, as seen in convert_frames_to_bitmap, is to first allocate the uninitialized bitmap by passing nullptr to mtmd_bitmap_init_from_video. Then, I get a mutable pointer using mtmd_bitmap_get_data_mutable and populate the data in place.

I'm definitely open to feedback, so if you have a suggestion for a better or more idiomatic approach, I would greatly appreciate the guidance!

Copy link
Collaborator

@ngxson ngxson Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality, LLM never process the video as a whole, so having one single mtmd_bitmap that stores the whole video is not a good approach.

Even Qwen model only process 2 frames at once, not the whole video. Audio encoders like Whisper only process 30s of audio at once (and in mtmd we do break up the audio into segments of 30s, each mtmd_bitmap contains exactly 30s of audio). So for video it should follow the same logic.

For minicpm-v, it seems like each frame of video is independent to the other, so the most intuitive is still one mtmd_bitmap per frame.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, using one mtmd_bitmap per frame also be better for a large video, as you can stream the processing instead of having to store the whole (decoded bitmap) video in memory.

Ofc we don't yet support this kind of streamed processing, but it can be something to added in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing your perspective. I understand your concern that adding the mtmd_bitmap_get_data_mutable interface might introduce design overhead or deviate from the established coding conventions.

My personal view was that the performance gain from avoiding the extra memory copy justified this trade-off.

However, if you feel that the cost of the memory copy is more acceptable than the cost of compromising the design principle, I am happy to modify it as you requested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are you sure that this function will eliminate copy?

This is the place where you use it:

    for (size_t i = 0; i < files.size(); i++) {
        if (i % stride != 0) continue;
        const std::string & f = files[i];
        if (!mtmd_helper::has_image_ext(f)) continue;
        mtmd::bitmap bmp(mtmd_helper_bitmap_init_from_file(ctx, f.c_str()));
        if (!bmp.ptr) continue;
        if(loaded==0){
            w = bmp.nx();
            h = bmp.ny();
            out_frames = mtmd_bitmap_init_from_video(w, h, nframes, nullptr);
            dest = mtmd_bitmap_get_data_mutable(out_frames);
        }
        GGML_ASSERT(bmp.nx() == w && bmp.ny() == h); // all frames must have the same size
        std::memcpy(dest,
                    bmp.data(),
                    bmp.n_bytes());
        dest += bmp.n_bytes();
        loaded++;
        if (loaded >= nframes) break;
    }

Notice that std::memcpy, you're copying the memory.

What I suggest is to instead of copying the memory, allocate multiple mtmd_bitmap_init_from_video like this:

unsigned char * frame_offset = nx*ny*i; // i is the index of current frame in the video
mtmd_bitmap_init_from_video(..., frame_data + frame_offset);

Copy link
Collaborator

@ngxson ngxson Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think more about how the API should be extended to support video input, will let you know when I'm ready.

This technique requires handling multiple video frames at once

Currently for minicpm-v 4.5, how many frames are being processed together?

Copy link
Author

@MrAMS MrAMS Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, are you sure that this function will eliminate copy?

You can check the code in convert_frames_to_bitmap; the copy overhead is avoided there.

The overhead is not avoided here because it reuses the mtmd_helper_bitmap_init_from_file function. And the overhead here is only a single frame's copy overhead, which is small compared to the video copy overhead within convert_frames_to_bitmap.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently for minicpm-v 4.5, how many frames are being processed together?

I will follow up with more information this coming Monday.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overhead is not avoided here because it reuses the mtmd_helper_bitmap_init_from_file function. And the overhead here is only a single frame's copy overhead, which is small compared to the video copy overhead within convert_frames_to_bitmap.

Maybe I can change AV_PIX_FMT_RGBA to AV_PIX_FMT_RGB24 in decode_video_ffmpeg_to_rgba_from_format_ctx to aviod RGBA->RGB 🤔

res.overview_size = clip_image_size{slice_size, slice_size};
res.refined_size = clip_image_size{0, 0};
res.grid_size = clip_image_size{0, 0};
if (clip_is_minicpmv(ctx)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the change here compared to the initial version?

IMO I think this is more like a duplicated logic, see clip_image_preprocess on how this is used by minicpm-v

// legacy
bool has_llava_projector = false;
int minicpmv_version = 0;
int minicpmv_max_slice_nums = 9;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search for max_slice_nums, it's already present in the code base.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sole purpose of this file is to convert video into images, it should be outside of the core mtmd library.

The core library only need to care about what is an image. It doesn't care what is a video. If we get a video, we can use mtmd-helper (which is outside of the core) to convert video into images

Copy link
Collaborator

@ngxson ngxson Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be renamed to mtmd-helper-video.cpp, or even better, move all the code to mtmd-helper.cpp

Also, since this is helper functions, it should never use internal headers like clip.h or clip-impl.h. It can only use the public header mtmd.h

Comment on lines 949 to 954
mtmd_bitmap * bitmap = new mtmd_bitmap;
bitmap->nx = nx;
bitmap->ny = ny;
bitmap->nz = nframes;
bitmap->type = mtmd_bitmap_type::VIDEO;
size_t data_size = (size_t)nx * ny * nframes * 3;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an image is just multiple frames grouped together, why don't we simply create multiple mtmd_bitmap, one for each frame?

@MrAMS
Copy link
Author

MrAMS commented Nov 1, 2025

Hi @ngxson ,

Thank you so much for your detailed review and all the valuable suggestions. I truly appreciate the time you've taken.

I will follow up with a more detailed discussion and implement the necessary revisions based on your feedback next week.

In the meantime, I'd like to offer some brief clarification on a couple of points:

  • You are right to point out the new video dependency. This was introduced because minicpm-v 4.5 requires a new processing technique to speed up video understanding. It needs to handle multiple video frames at once, as the frames are no longer independent during the "eval chunk" step, which requires holistic information from the set of frames.
  • Regarding the points on mtmd-helper and clip.cpp, I agree they need more work. I will address these and continue the discussion in the next update.

Thank you again for your guidance!

@ngxson
Copy link
Collaborator

ngxson commented Nov 1, 2025

  • You are right to point out the new video dependency. This was introduced because minicpm-v 4.5 requires a new processing technique to speed up video understanding. It needs to handle multiple video frames at once, as the frames are no longer independent during the "eval chunk" step, which requires holistic information from the set of frames.

In this case, I think it's better to firstly figure out how to re-enable real batching for certain models (probably starts with minicpm-v). I removed batching support because it was tricky to maintain, but we will eventually need to bring it back. Batching is needed if we want to do pooling multiple frames at once.

I think pooling multiple frame at once will be pretty much the norm for video processing for future models, so it's worth to support it. The video can be broken down into groups of frames, for example if the video contains 10 frames and the model support temporal pooling of 2, we will end up having 10/2=5 chunks.

Btw, could you point me to the python reference code of minicpm-v 4.5, where multiple frames are merged together?

@MrAMS
Copy link
Author

MrAMS commented Nov 1, 2025

In this case, I think it's better to firstly figure out how to re-enable real batching for certain models (probably starts with minicpm-v). I removed batching support because it was tricky to maintain, but we will eventually need to bring it back. Batching is needed if we want to do pooling multiple frames at once.

I agree that supporting multi-frame processing is the correct goal. Regarding the "batching" approach, I was slightly concerned that a generic batching mechanism might make the API's intent ambiguous—specifically, how to distinguish between a "batch of unrelated images" and a "sequence of video frames," which require different handling.

To keep the code clear and simple for this specific video use-case, I added the flexible third dimension nz and video type to the bitmap. My thinking was that this is a very minimal-cost modification that directly supports the video requirement.

Moreover, the meaning of nz itself can be further extended, which lays a foundation for future extensibility. This approach solves our immediate need for video frame processing clearly, without complicating the separate, larger discussion of re-enabling general-purpose batching.

@ngxson
Copy link
Collaborator

ngxson commented Nov 1, 2025

I agree that supporting multi-frame processing is the correct goal. Regarding the "batching" approach, I was slightly concerned that a generic batching mechanism might make the API's intent ambiguous—specifically, how to distinguish between a "batch of unrelated images" and a "sequence of video frames," which require different handling.

Hmm then how do we "merge" the information from consecutive frames? I imagine you're processing multiple frames in batch, then after the final self-attention layer, you merge some embedding vectors together using something like AvgPool2D or Conv2D

Ofc we will need to add a param to indicate if the images are "sequences of video frame" or just normal unrelated images. I don't think we can escape this.

To keep the code clear and simple for this specific video use-case, I added the flexible third dimension nz and video type to the bitmap. My thinking was that this is a very minimal-cost modification that directly supports the video requirement.

On second thought, I think yes we can add this type of mtmd_bitmap into the code base. This should be pretty much the same use case with audio input, where one single mtmd_bitmap will be "tokenized" into small chunks, 30s each. For video, one single mtmd_bitmap will be equivalent to multiple chunks. Number of frames in each chunk will be correspond to batch size that I mentioned above.

Finally, to avoid copying memory, we can initialize mtmd_bitmap using a callback function instead. This will allow the decoder program to stream the video frames, and the copy overhead will now be reduced significantly.

I'll try to draft a "placeholder" version of the API for this. Will let you know when it's ready.

@ehoogeveen-medweb
Copy link

Possibly relevant, are the batches treated separately or does the model use something closer to a sliding window (potentially requiring images from 2 consecutive batches)?

@MrAMS
Copy link
Author

MrAMS commented Nov 2, 2025

Hmm then how do we "merge" the information from consecutive frames? I imagine you're processing multiple frames in batch, then after the final self-attention layer, you merge some embedding vectors together using something like AvgPool2D or Conv2D

Thank you for the ongoing discussion.Just to clarify the approach @tc-mb and I discussed, we're planning to implement this in two steps:

  • This PR (Step 1 - Foundation): The goal here is to integrate the most basic, foundational support for video into mtmd. We believe it's necessary to explicitly add video-related data structures or flags—rather than only reusing the existing image logic. This creates a clear and necessary distinction for video as a type.
  • The Next PR (Step 2 - Feature): Once this stable foundation is merged, we plan to submit a second feature PR for minicpm-v 4.5. This PR will build on the foundation from Step 1 to add support for the "new processing technique" (speeding up video understanding by handling multiple video frames at once).

My implementation idea to support this is:

  • "Ordinary" videos will continue to be processed as they are in this current PR (which may still leverage parts of the image logic).
  • However, models like minicpm-v 4.5 will introduce a new chunk type (e.g., video_chunk) to signal to the "eval chunk" step that this requires special, multi-frame processing.

On second thought, I think yes we can add this type of mtmd_bitmap into the code base. This should be pretty much the same use case with audio input, where one single mtmd_bitmap will be "tokenized" into small chunks, 30s each. For video, one single mtmd_bitmap will be equivalent to multiple chunks. Number of frames in each chunk will be correspond to batch size that I mentioned above.

One important clarification: "video" in this context does not necessarily mean a full, complete video file. It can also refer to a video segment or clip, where the length (number of frames) is a multiple of the number of frames the model processes at one time just like what you say.

Finally, to avoid copying memory, we can initialize mtmd_bitmap using a callback function instead. This will allow the decoder program to stream the video frames, and the copy overhead will now be reduced significantly.

My personal view is that for this specific problem, the simplicity, effectiveness, and performance of the solution should have a higher priority than strict adherence to a particular coding style or design pattern.

This is why I preferred the mtmd_bitmap_get_data_mutable approach. It is, in my opinion, more concise and easier to use, and the performance gain from avoiding the copy justifies this trade-off. However, I am not completely opposed to a callback. If you can propose a callback solution that is similarly easy to use, I am very willing to adopt it.


As a separate, long-term point:

I have also been planning how to best allow the decoder program to stream the video frames, like discussions on how to design omni model

My intended approach is to segment the video stream into small "video" clips (i.e., bitmap with type of video) and feed these segments to mtmd one by one by another thread which run ffmpeg related work. This streaming method would also neatly avoid the need to modify the bitmap logic at all, which might be the best long-term solution.


// set MiniCPM-V UHD slicing upper bound (used when preprocessing images for MiniCPM-V)
// values < 1 will be clamped to 1
MTMD_API void mtmd_set_minicpmv_max_slice_nums(mtmd_context * ctx, int n);
Copy link
Collaborator

@ngxson ngxson Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to note that we never accept creating model-specific public API like this one.

Also, number of slices (as well as the slicing process) is completely managed by mtmd_tokenize, the should never leak this info to the public API.

Copy link
Author

@MrAMS MrAMS Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has already been fixed in commit ef68f2a8bbf60e.

Comment on lines 26 to 30
enum class mtmd_bitmap_type {
IMAGE,
AUDIO,
VIDEO,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convention of enum in llama.cpp is UPPERCASE_PREFIX_VALUE

Example: MTMD_BITMAP_TYPE_IMAGE, we don't use enum namespace ::

Copy link
Author

@MrAMS MrAMS Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has already been fixed in commit ef68f2a8bbf60e.

Comment on lines 42 to 45
option(MTMD_MAX_VIDEO_FRAMES_SMALL "Set a small number of frames for fast test locally" OFF)
if(MTMD_MAX_VIDEO_FRAMES_SMALL)
target_compile_definitions(mtmd PRIVATE MTMD_MAX_VIDEO_FRAMES_SMALL)
endif()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be added as an env var instead. See MTMD_DEBUG_GRAPH as an example

Copy link
Author

@MrAMS MrAMS Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has already been fixed in commit ef68f2a8bbf60e.

Comment on lines 91 to 95
namespace mtmd_helper{
bool has_image_ext(const std::string & name);
bool is_dir(const std::string & path);
void list_files(const std::string & dir, std::vector<std::string> & out, bool recursive);
}
Copy link
Collaborator

@ngxson ngxson Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is suppose to contain C++ wrapper for the functions above

The functions you're adding here does not wrap anything above, they should be removed

Copy link
Author

@MrAMS MrAMS Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has already been fixed in commit ef68f2a8bbf60e.

@ngxson
Copy link
Collaborator

ngxson commented Nov 2, 2025

@MrAMS @tc-mb I address my comments in this PR: ngxson#32

I would suggest you rebase your implementation to using my PR above. Please let me know if I missed anything.

Please note that, in order to make the code review easier, I prefer adding functions progressively, each in a dedicated PR.

For now, the current PR is about the base API (or the foundation as @MrAMS suggested). This includes the most essential infra for video processing, including the new bitmap type, helpers to decode the video file into bitmap. Things like: loading a list of image files, model-specific changes, etc are out-of-scope. They should be added later, not now.

Also note that, we should only aim for supporting llama-mtmd-cli first, then add support for llama-server later. This will allow other maintainers to keep track of the changes for each submodule.

@MrAMS MrAMS force-pushed the Support-video-in-mtmd branch from 30b0aaa to c5e9151 Compare November 3, 2025 09:08
@MrAMS MrAMS requested a review from ngxson November 3, 2025 09:12
@MrAMS MrAMS force-pushed the Support-video-in-mtmd branch from c5e9151 to ef68f2a Compare November 3, 2025 09:38
@MrAMS
Copy link
Author

MrAMS commented Nov 3, 2025

Hi @ngxson,

Thank you so much for your incredibly detailed and constructive feedback, and especially for providing the demo.

We discussed it, and given the significant effort required to migrate and re-debug (as well as the internal development we've already based on this branch), we have opted to continue improving this PR based on your suggestions.

We want to thank you again for the example project. It was very insightful and directly inspired several of our changes, such as dropping mtmd_bitmap_get_data_mutable in favor of the new mtmd_bitmap_set_frame interface.


We have just pushed an updated version that comprehensively addresses your review. The main changes include:

  • Undo the changes in llama-server
  • Added a check for magic bytes to validate image files.
  • Removed the mtmd_bitmap_get_data_mutable interface and adopted your suggestion of using mtmd_bitmap_set_frame.
  • Cleaned up the mtmd-helper interfaces and removed the clip-impl.h dependency from mtmd-video.cpp.
  • Updated enum naming to follow the UPPERCASE_PREFIX_VALUE convention.
  • Removed the MTMD_MAX_VIDEO_FRAMES_SMALL CMake Option, replacing it with an environment variable.

There are some points we'd like to clarify:

@tc-mb and I have re-confirmed, and the code in tools/mtmd/clip.cpp (specifically minicpmv_max_slice_nums and the if (clip_is_minicpmv(ctx)) check) is necessary for our use case. The reason is that the minicpmv model requires minicpmv_max_slice_nums as a global, model-wide parameter. This parameter is also required by the get_slice_instructions function to correctly process its image slicing logic.

Regarding the topic of streaming support, I've introduced DecodedFramesQueue in this latest commit. This frame queue is designed for streaming scenarios, where an ffmpeg-related thread acts as the producer (generating frames), and the main program (the LLM inference thread) acts as the consumer (processing them). This commit will lay the foundation for future streaming support.

Finally, thank you again for your thorough review and all the guidance. We would be very grateful if this PR could be merged into master once you are satisfied. This will provide the stable foundation we need for our second, follow-up PR, which will add support for the "new processing technique" (speeding up video understanding by handling multiple video frames at once).

Please let us know if you have any further feedback. We are happy to address any remaining points.

@MrAMS MrAMS marked this pull request as ready for review November 3, 2025 09:49
@MrAMS MrAMS changed the title Support video in mtmd and server Support video in mtmd (base API) Nov 3, 2025
@ngxson
Copy link
Collaborator

ngxson commented Nov 3, 2025

@tc-mb and I have re-confirmed, and the code in tools/mtmd/clip.cpp (specifically minicpmv_max_slice_nums and the if (clip_is_minicpmv(ctx)) check) is necessary for our use case. The reason is that the minicpmv model requires minicpmv_max_slice_nums as a global, model-wide parameter. This parameter is also required by the get_slice_instructions function to correctly process its image slicing logic.

I don't mean clip_is_minicpmv is not required, but I think you should put it outside of get_slice_instructions. You can see the example in clip_image_preprocess where we supply parameters to get_slice_instructions. The main goal is to make get_slice_instructions a generic, model-independent function.


Since I still need the second PR to fully understand how the API and the code base should be structured, I think it's safer if you can just push everything to this PR. There is no need to create a 2nd PR.

Please note that I won't merge your PR, but in parallel I will make a more cleaner version (like ngxson#32) which will eventually be merged to the main project. This way, you can freely implement what you want. I can make the code looks good, and we can also save our time from having to go back and forth between us.

I hope you understand this, but as the core maintainer of the library, I have the responsibility to maintain a code base that will work with all kinds of models: Qwen, MiniCPM, Llama, Gemma, etc. Therefore, I cannot risk adding too many model-specific functionalities - they should be generalize to work cross multiple models. So I hope we can do what I suggested above to make things go smoother.

If this sounds good to you, I'll leave you freely continue with this PR. You and ping me when you're done and need my review.

@tc-mb
Copy link
Contributor

tc-mb commented Nov 3, 2025

Hi, I think there's been a bit of a misunderstanding, let me explain. I didn't expect that just over the weekend, something like this could have happened. I was busy with Omni's development and didn't see this issue and clarify it immediately. -_-

This branch is for our ongoing development; we still need to develop some features to ensure support for the high refresh rate capabilities of MiniCPM-V 4.5. My intern just wanted to see the differences from the main branch, but was a bit inexperienced, so he committed to the official llama.cpp branch.

Regarding the video interface, server support, and MiniCPM-V modifications, we added many features, and it shouldn't all be merged into one pull request. My suggestion is to wait until development is complete, confirm accuracy, and then, based on the number of added features, separately split and merge them into the llama.cpp branch.

@MrAMS As I said, we haven't finished development yet and shouldn't submit a pull request now. You can close this pull request.

@ngxson The multimodal model is starting to move towards video understanding and more features later. Although this PR was a bit of a misunderstanding, perhaps we can start considering the video interface.

We can extract the basic video capabilities and submit a simple PR. Or, as you mentioned above, you can refer to this branch and complete a PR yourself. We think both are acceptable, and we are open to hearing your opinion.

@MrAMS MrAMS closed this Nov 3, 2025
@ngxson
Copy link
Collaborator

ngxson commented Nov 3, 2025

@tc-mb thanks for the clarification.

Re. your point about considering the video API, yes I definitely want to add it as many models are now supporting video input. We just need to make sure the same infrastructure can be share among models, as I explained earlier.

We can extract the basic video capabilities and submit a simple PR. Or, as you mentioned above, you can refer to this branch and complete a PR yourself. We think both are acceptable, and we are open to hearing your opinion.

Since I already had enough understanding of your direction, I think you can continue development on this branch. I'll mirror it as a PR on my fork at https://github.com/ngxson/llama.cpp ; you can always ping me when you're ready.

I think this will be the better solution for both of us, as it allow you to freely make changes, while at the same time, I can continue to develop the video infrastructure that will be compatible with may other models like Qwen (probably while doing so, I will also add video support for Qwen model)

And finally, yes I'm willing to help OpenBMB/MiniCPM team to bring this feature to llama.cpp. This has been a highly requested feature and I think it's now time to finally make it happen.

@MrAMS
Copy link
Author

MrAMS commented Nov 12, 2025

Hi @ngxson

A significant portion of the implementation in this PR ngxson#32 is directly based on my code from ggml-org#16910, including most of the bitmap nz and ffmpeg-related code.

To ensure accurate attribution, could you please add the following co-author tags to the final merge commit message? (See: Creating a commit with multiple authors)

Co-authored-by: MrAMS <2421653893@qq.com>
Co-authored-by: tc-mb <caitianchi@modelbest.cn>

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants