Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Nov 10, 2025

NOTE:

  • This is a very hacky PoC, only tested on Mac
  • Not yet API for download model
  • Not yet API for unload model
  • Not yet support for streaming

To download a model to cache:

llama-cli -hf ggml-org/gemma-3-4b-it-GGUF:latest

Then, run server:

llama-server      # note: do not specify -m

API:

GET http://localhost:8080/models

{
    "models": [
        {
            "model": "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M",
            "loaded": false
        },
        {
            "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
            "loaded": false
        }
    ]
}

Load a model:

POST: http://localhost:8080/models/load
body: { "model": "ggml-org/gemma-3-4b-it-GGUF:latest" }

Then, run a completion:

POST: http://localhost:8080/v1/chat/completions
body:
{
  "model": "ggml-org/gemma-3-4b-it-GGUF:latest",
  "messages": [
    {
      "role": "user",
      "content": "who are you"
    }
  ],
  "stream": false,
  "max_tokens": 16
}

@coderabbitai
Copy link

coderabbitai bot commented Nov 10, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch xsn/poc_proxy_router

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants