Skip to content

Commit f329664

Browse files
committed
[UI] Add chatbot ui support
1 parent ed56d26 commit f329664

File tree

9 files changed

+283
-46
lines changed

9 files changed

+283
-46
lines changed

README.md

Lines changed: 36 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,67 @@
11
# EmbeddedLLM
22

3-
Run local LLMs on iGPU and APU (AMD , Intel, and Qualcomm (Coming Soon))
3+
Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon))
44

5-
|Support matrix| Supported now| Under Development | On the roadmap|
6-
|--------------|--------------|-------------------|---------------|
7-
|Model architectures| Gemma <br/> Llama * <br/> Mistral + <br/>Phi <br/>||
8-
|Platform| Linux <br/> Windows | ||||
9-
|Architecture|x86 <br/> x64 <br/> | Arm64 |||
10-
|Hardware Acceleration|CUDA<br/>DirectML<br/>|QNN <br/> ROCm |OpenVINO
5+
| Support matrix | Supported now | Under Development | On the roadmap |
6+
| --------------------- | --------------------------------------------------- | ----------------- | -------------- | --- | --- |
7+
| Model architectures | Gemma <br/> Llama \* <br/> Mistral + <br/>Phi <br/> | |
8+
| Platform | Linux <br/> Windows | | | | |
9+
| Architecture | x86 <br/> x64 <br/> | Arm64 | | |
10+
| Hardware Acceleration | CUDA<br/>DirectML<br/> | QNN <br/> ROCm | OpenVINO |
1111

1212
\* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
1313

1414
\+ The Mistral model architecture supports similar model families such as Zephyr.
1515

16-
17-
1816
## 🚀 Latest News
17+
1918
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
2019
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
2120

22-
2321
## Supported Models (Quick Start)
24-
| Models | Parameters | Context Length | Link |
25-
|---------------------|------------|----------------|------|
26-
|Gemma-2b-Instruct v1 | 2B | 8192 | [EmbeddedLLM/gemma-2b-it-onnx](https://huggingface.co/EmbeddedLLM/gemma-2b-it-onnx) |
27-
|Llama-2-7b-chat | 7B | 4096 | [EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml) |
28-
|Llama-2-13b-chat | 13B | 4096 | [EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml) |
29-
|Llama-3-8b-chat | 8B | 8192 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
30-
|Mistral-7b-v0.3-instruct| 7B | 32768 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx)|
31-
| Phi3-mini-4k-instruct | 3.8B | 4096 | [microsoft/Phi-3-mini-4k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) |
32-
| Phi3-mini-128k-instruct | 3.8B | 128k | [microsoft/Phi-3-mini-128k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx) |
33-
| Phi3-medium-4k-instruct | 17B | 4096 | [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml)|
34-
| Phi3-medium-128k-instruct | 17B | 128k | [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml)|
22+
23+
| Models | Parameters | Context Length | Link |
24+
| --- | --- | --- | --- |
25+
| Gemma-2b-Instruct v1 | 2B | 8192 | [EmbeddedLLM/gemma-2b-it-onnx](https://huggingface.co/EmbeddedLLM/gemma-2b-it-onnx) |
26+
| Llama-2-7b-chat | 7B | 4096 | [EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml) |
27+
| Llama-2-13b-chat | 13B | 4096 | [EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml) |
28+
| Llama-3-8b-chat | 8B | 8192 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
29+
| Mistral-7b-v0.3-instruct | 7B | 32768 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
30+
| Phi3-mini-4k-instruct | 3.8B | 4096 | [microsoft/Phi-3-mini-4k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) |
31+
| Phi3-mini-128k-instruct | 3.8B | 128k | [microsoft/Phi-3-mini-128k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx) |
32+
| Phi3-medium-4k-instruct | 17B | 4096 | [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml) |
33+
| Phi3-medium-128k-instruct | 17B | 128k | [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml) |
3534

3635
## Getting Started
3736

3837
### Installation
3938

4039
#### From Source
40+
4141
**Windows**
42+
4243
1. Install embeddedllm package. `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
4344
- **DirectML:** `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml]`
4445
- **CPU:** `$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu]`
4546
- **CUDA:** `$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda]`
47+
- **With Web UI**:
48+
- **DirectML:** `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml, webui]`
49+
- **CPU:** `$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu, webui]`
50+
- **CUDA:** `$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda, webui]`
4651

4752
**Linux**
53+
4854
1. Install embeddedllm package. `ELLM_TARGET_DEVICE='directml' pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
4955
- **DirectML:** `ELLM_TARGET_DEVICE='directml' pip install -e .[directml]`
5056
- **CPU:** `ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu]`
5157
- **CUDA:** `ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda]`
52-
58+
- **With Web UI**:
59+
- **DirectML:** `ELLM_TARGET_DEVICE='directml' pip install -e .[directml, webui]`
60+
- **CPU:** `ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu, webui]`
61+
- **CUDA:** `ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda, webui]`
5362

5463
### Launch OpenAI API Compatible Server
64+
5565
```
5666
usage: ellm_server.exe [-h] [--port int] [--host str] [--response_role str] [--uvicorn_log_level str]
5767
[--served_model_name str] [--model_path str] [--vision bool]
@@ -72,7 +82,10 @@ options:
7282
1. `ellm_server --model_path <path/to/model/weight>`.
7383
2. Example code to connect to the api server can be found in `scripts/python`.
7484

85+
## Launch Chatbot Web UI
7586

87+
1. `ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost`.
7688

7789
## Acknowledgements
78-
* Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.
90+
91+
- Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.

requirements-cpu.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
onnxruntime
2-
onnxruntime-genai
2+
onnxruntime-genai==0.3.0rc2

requirements-webui.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
gradio~=4.36.1

scripts/python/httpx_client_stream.py

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,30 @@
11
import asyncio
22

33
import httpx
4+
import json
45

6+
def parse_stream(stream:str):
7+
8+
stream = stream.replace('data: ', '')
9+
10+
response_obj = json.loads(stream)
11+
# print(response_obj)
12+
13+
return response_obj
514

615
async def stream_chat_completion(url: str, payload: dict):
716
async with httpx.AsyncClient() as client:
817
async with client.stream("POST", url, json=payload) as response:
918
if response.status_code == 200:
1019
async for data in response.aiter_bytes():
1120
if data:
12-
print(data.decode("utf-8"))
21+
decodes_stream = data.decode("utf-8")
22+
if "[DONE]" in decodes_stream:
23+
continue
24+
resp = parse_stream(decodes_stream)
25+
if resp["choices"][0]["delta"].get('content', None):
26+
print(resp["choices"][0]["delta"]["content"], end='', flush=True)
27+
1328
# time.sleep(1)
1429
else:
1530
print(f"Error: {response.status_code}")
@@ -20,9 +35,9 @@ async def stream_chat_completion(url: str, payload: dict):
2035
if __name__ == "__main__":
2136
url = "http://localhost:6979/v1/chat/completions"
2237
payload = {
23-
"messages": [{"role": "user", "content": "Hello!"}],
38+
"messages": [{"role": "user", "content": "What is the fastest bird on earth?"}],
2439
"model": "phi3-mini-int4",
25-
"max_tokens": 80,
40+
"max_tokens": 200,
2641
"temperature": 0.0,
2742
"stream": True,
2843
}

setup.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,7 @@ def get_ellm_version() -> str:
130130
# Add other metadata and dependencies as needed
131131
extras_require={
132132
"lint": _read_requirements("requirements-lint.txt"),
133+
"webui": _read_requirements("requirements-webui.txt"),
133134
"cuda": ["onnxruntime-genai-cuda==0.3.0rc2"],
134135
},
135136
dependency_links=[
@@ -138,6 +139,7 @@ def get_ellm_version() -> str:
138139
entry_points={
139140
"console_scripts": [
140141
"ellm_server=embeddedllm.entrypoints.api_server:main",
142+
"ellm_chatbot=embeddedllm.entrypoints.webui:main",
141143
],
142144
},
143145
)

src/embeddedllm/entrypoints/api_server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33
from fastapi import FastAPI, Request
44
from fastapi.exceptions import RequestValidationError
55
from fastapi.responses import JSONResponse, Response, StreamingResponse
6-
from pydantic_settings import BaseSettings, SettingsConfigDict
76
from pydantic import Field
7+
from pydantic_settings import BaseSettings, SettingsConfigDict
88

99
from embeddedllm.entrypoints.chat_server import OpenAPIChatServer
1010
from embeddedllm.protocol import ( # noqa: E501

src/embeddedllm/entrypoints/chat_server.py

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,29 +26,22 @@
2626

2727
from embeddedllm.engine import EmbeddedLLMEngine
2828
from embeddedllm.inputs import ImagePixelData, PromptInputs
29-
from embeddedllm.protocol import ( # noqa: E501
29+
from embeddedllm.protocol import ( # noqa: E501; ChatCompletionLogProb,; ChatCompletionLogProbs,; ChatCompletionLogProbsContent,; ChatCompletionNamedToolChoiceParam,; CompletionOutput,; FunctionCall,; ToolCall,
3030
ChatCompletionContentPartParam,
31-
# ChatCompletionLogProb,
32-
# ChatCompletionLogProbs,
33-
# ChatCompletionLogProbsContent,
3431
ChatCompletionMessageParam,
35-
# ChatCompletionNamedToolChoiceParam,
3632
ChatCompletionRequest,
3733
ChatCompletionResponse,
3834
ChatCompletionResponseChoice,
3935
ChatCompletionResponseStreamChoice,
4036
ChatCompletionStreamResponse,
4137
ChatMessage,
42-
# CompletionOutput,
4338
CompletionRequest,
4439
DeltaMessage,
4540
ErrorResponse,
46-
# FunctionCall,
4741
ModelCard,
4842
ModelList,
4943
ModelPermission,
5044
RequestOutput,
51-
# ToolCall,
5245
UsageInfo,
5346
)
5447
from embeddedllm.utils import decode_base64, random_uuid

0 commit comments

Comments
 (0)