|
| 1 | +# MTMD interactive mode |
| 2 | + |
| 3 | +`MtmdInteractiveModeExecute` shows how to pair a multimodal projection with a text model so the chat loop can reason over images supplied at runtime. The sample lives in `LLama.Examples/Examples/MtmdInteractiveModeExecute.cs` and reuses the interactive executor provided by LLamaSharp. |
| 4 | + |
| 5 | +## Workflow |
| 6 | +- Resolve the model, multimodal projection, and sample image paths via `UserSettings`. |
| 7 | +- Create `ModelParams` for the text model and capture the MTMD defaults with `MtmdContextParams.Default()`. |
| 8 | +- Load the base model and context, then initialize `SafeMtmdWeights` with the multimodal projection file. |
| 9 | +- Ask the helper for a media marker (`mtmdParameters.MediaMarker ?? NativeApi.MtmdDefaultMarker() ?? "<media>"`) and feed it into an `InteractiveExecutor`. |
| 10 | + |
| 11 | +```cs |
| 12 | +var mtmdParameters = MtmdContextParams.Default(); |
| 13 | + |
| 14 | +using var model = await LLamaWeights.LoadFromFileAsync(parameters); |
| 15 | +using var context = model.CreateContext(parameters); |
| 16 | + |
| 17 | +// Mtmd Init |
| 18 | +using var clipModel = await SafeMtmdWeights.LoadFromFileAsync( |
| 19 | + multiModalProj, |
| 20 | + model, |
| 21 | + mtmdParameters); |
| 22 | + |
| 23 | +var mediaMarker = mtmdParameters.MediaMarker |
| 24 | + ?? NativeApi.MtmdDefaultMarker() |
| 25 | + ?? "<media>"; |
| 26 | + |
| 27 | +var ex = new InteractiveExecutor(context, clipModel); |
| 28 | +``` |
| 29 | + |
| 30 | +## Handling user input |
| 31 | +- Prompts can include image paths wrapped in braces (for example `{c:/image.jpg}`); the loop searches for those markers with regular expressions. |
| 32 | +- Every referenced file is loaded through `SafeMtmdWeights.LoadMedia`, producing `SafeMtmdEmbed` instances that are queued for the next tokenization call. |
| 33 | +- When the user provides images, the executor clears its KV cache (`MemorySequenceRemove`) before replacing each brace-wrapped path in the prompt with the multimodal marker. |
| 34 | +- The embeds collected for the current turn are copied into `ex.Embeds`, so the executor submits both the text prompt and the pending media to the helper before generation. |
| 35 | + |
| 36 | +## Running the sample |
| 37 | +1. Ensure the model and projection paths returned by `UserSettings` exist locally. |
| 38 | +2. Start the example (for instance from the examples host application) and observe the initial description printed to the console. |
| 39 | +3. Type text normally, or reference new images by including their path inside braces. Type `/exit` to end the conversation. |
| 40 | + |
| 41 | +This walkthrough mirrors the logic in the sample so you can adapt it for your own multimodal workflows. |
0 commit comments