Skip to content

Commit 113cfc2

Browse files
committed
undo the changes and submit a separate pull request to the llama-server later
1 parent 1e9c563 commit 113cfc2

File tree

3 files changed

+9
-55
lines changed

3 files changed

+9
-55
lines changed

tools/server/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -430,7 +430,7 @@ Multiple prompts are also supported. In this case, the completion result will be
430430
- Strings, JSON objects, and sequences of tokens: `["string1", [12, 34, 56], { "prompt_string": "string", "multimodal_data": ["base64"]}]`
431431
- Mixed types: `[[12, 34, "string", 56, 78], [12, 34, 56], "string", { "prompt_string": "string" }]`
432432

433-
Note for `multimodal_data` in JSON object prompts. This should be an array of strings, containing base64 encoded multimodal data such as images, audio and video. There must be an identical number of MTMD media markers in the string prompt element which act as placeholders for the data provided to this parameter. The multimodal data files will be substituted in order. The marker string (e.g. `<__media__>`) can be found by calling `mtmd_default_marker()` defined in [the MTMD C API](https://github.com/ggml-org/llama.cpp/blob/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0/tools/mtmd/mtmd.h#L87). A client *must not* specify this field unless the server has the multimodal capability. Clients should check `/models` or `/v1/models` for the `multimodal` capability before a multimodal request.
433+
Note for `multimodal_data` in JSON object prompts. This should be an array of strings, containing base64 encoded multimodal data such as images and audio. There must be an identical number of MTMD media markers in the string prompt element which act as placeholders for the data provided to this parameter. The multimodal data files will be substituted in order. The marker string (e.g. `<__media__>`) can be found by calling `mtmd_default_marker()` defined in [the MTMD C API](https://github.com/ggml-org/llama.cpp/blob/5fd160bbd9d70b94b5b11b0001fd7f477005e4a0/tools/mtmd/mtmd.h#L87). A client *must not* specify this field unless the server has the multimodal capability. Clients should check `/models` or `/v1/models` for the `multimodal` capability before a multimodal request.
434434

435435
`temperature`: Adjust the randomness of the generated text. Default: `0.8`
436436

@@ -1211,7 +1211,7 @@ print(completion.choices[0].text)
12111211

12121212
Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggml-org/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.
12131213

1214-
If model supports multimodal, you can input the media file via `image_url` or `video_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.
1214+
If model supports multimodal, you can input the media file via `image_url` content part. We support both base64 and remote URL as input. See OAI documentation for more.
12151215

12161216
*Options:*
12171217

tools/server/server.cpp

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3926,20 +3926,20 @@ struct server_context {
39263926

39273927
SLT_INF(slot, "n_tokens = %d, memory_seq_rm [%d, end)\n", slot.prompt.n_tokens(), p0);
39283928

3929-
// check if we should process the media chunk (image, audio, video, ...)
3929+
// check if we should process the image
39303930
if (slot.prompt.n_tokens() < slot.task->n_tokens() && input_tokens[slot.prompt.n_tokens()] == LLAMA_TOKEN_NULL) {
3931-
// process the media
3931+
// process the image
39323932
size_t n_tokens_out = 0;
39333933
int32_t res = input_tokens.process_chunk(ctx, mctx, slot.prompt.n_tokens(), slot.prompt.tokens.pos_next(), slot.id, n_tokens_out);
39343934
if (res != 0) {
3935-
SLT_ERR(slot, "failed to process media, res = %d\n", res);
3936-
send_error(slot, "failed to process media", ERROR_TYPE_SERVER);
3935+
SLT_ERR(slot, "failed to process image, res = %d\n", res);
3936+
send_error(slot, "failed to process image", ERROR_TYPE_SERVER);
39373937
slot.release();
39383938
continue;
39393939
}
39403940

39413941
slot.n_prompt_tokens_processed += n_tokens_out;
3942-
// add the media chunk to cache
3942+
// add the image chunk to cache
39433943
{
39443944
const auto & chunk = input_tokens.find_chunk(slot.prompt.n_tokens());
39453945
slot.prompt.tokens.push_back(chunk.get()); // copy

tools/server/utils.hpp

Lines changed: 2 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -679,53 +679,7 @@ static json oaicompat_chat_params_parse(
679679
p["text"] = mtmd_default_marker();
680680
p.erase("input_audio");
681681

682-
} else if (type == "video_url") {
683-
if (!opt.allow_image) { // TODO: separate video flag?
684-
throw std::runtime_error("video input is not supported - hint: if this is unexpected, you may need to provide the mmproj");
685-
}
686-
687-
json video_url = json_value(p, "video_url", json::object());
688-
std::string url = json_value(video_url, "url", std::string());
689-
if (string_starts_with(url, "http")) {
690-
// download remote image
691-
// TODO @ngxson : maybe make these params configurable
692-
common_remote_params params;
693-
params.headers.push_back("User-Agent: llama.cpp/" + build_info);
694-
params.max_size = 1024 * 1024 * 100; // 100MB
695-
params.timeout = 100; // seconds
696-
SRV_INF("downloading video from '%s'\n", url.c_str());
697-
auto res = common_remote_get_content(url, params);
698-
if (200 <= res.first && res.first < 300) {
699-
SRV_INF("downloaded %ld bytes\n", res.second.size());
700-
raw_buffer data;
701-
data.insert(data.end(), res.second.begin(), res.second.end());
702-
out_files.push_back(data);
703-
} else {
704-
throw std::runtime_error("Failed to download video");
705-
}
706-
707-
} else {
708-
// try to decode base64 video
709-
std::vector<std::string> parts = string_split<std::string>(url, /*separator*/ ',');
710-
if (parts.size() != 2) {
711-
throw std::runtime_error("Invalid video_url.url value");
712-
} else if (!string_starts_with(parts[0], "data:video/")) {
713-
throw std::runtime_error("Invalid video_url.url format: " + parts[0]);
714-
} else if (!string_ends_with(parts[0], "base64")) {
715-
throw std::runtime_error("video_url.url must be base64 encoded");
716-
} else {
717-
auto base64_data = parts[1];
718-
auto decoded_data = base64_decode(base64_data);
719-
out_files.push_back(decoded_data);
720-
}
721-
}
722-
723-
// replace this chunk with a marker
724-
p["type"] = "text";
725-
p["text"] = mtmd_default_marker();
726-
p.erase("video_url");
727-
728-
}else if (type != "text") {
682+
} else if (type != "text") {
729683
throw std::runtime_error("unsupported content[].type");
730684
}
731685
}
@@ -1460,7 +1414,7 @@ static server_tokens process_mtmd_prompt(mtmd_context * mctx, std::string prompt
14601414
for (auto & file : files) {
14611415
mtmd::bitmap bmp(mtmd_helper_bitmap_init_from_buf(mctx, file.data(), file.size()));
14621416
if (!bmp.ptr) {
1463-
throw std::runtime_error("Failed to load media file");
1417+
throw std::runtime_error("Failed to load image or audio file");
14641418
}
14651419
// calculate bitmap hash (for KV caching)
14661420
std::string hash = fnv_hash(bmp.data(), bmp.n_bytes());

0 commit comments

Comments
 (0)