[UI] Add chatbot ui support

tjtanaa · tjtanaa · commit f329664b65d0 · 2024-06-20T11:52:45.000+08:00
diff --git a/README.md b/README.md
@@ -1,57 +1,67 @@
 # EmbeddedLLM
 
-Run local LLMs on iGPU and APU (AMD , Intel, and Qualcomm (Coming Soon))
+Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon))
 
-|Support matrix| Supported now| Under Development | On the roadmap|
-|--------------|--------------|-------------------|---------------|
-|Model architectures|  Gemma <br/> Llama * <br/> Mistral + <br/>Phi <br/>||
-|Platform| Linux <br/> Windows  |  ||||
-|Architecture|x86 <br/> x64 <br/> | Arm64 |||
-|Hardware Acceleration|CUDA<br/>DirectML<br/>|QNN <br/> ROCm |OpenVINO
+| Support matrix        | Supported now                                       | Under Development | On the roadmap |
+| --------------------- | --------------------------------------------------- | ----------------- | -------------- | --- | --- |
+| Model architectures   | Gemma <br/> Llama \* <br/> Mistral + <br/>Phi <br/> |                   |
+| Platform              | Linux <br/> Windows                                 |                   |                |     |     |
+| Architecture          | x86 <br/> x64 <br/>                                 | Arm64             |                |     |
+| Hardware Acceleration | CUDA<br/>DirectML<br/>                              | QNN <br/> ROCm    | OpenVINO       |
 
 \* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
 
 \+ The Mistral model architecture supports similar model families such as Zephyr.
 
-
-
 ## 🚀 Latest News
+
 - [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
 - [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
 
-
 ## Supported Models (Quick Start)
-| Models              | Parameters | Context Length | Link |
-|---------------------|------------|----------------|------|
-|Gemma-2b-Instruct v1 |       2B   | 8192           | [EmbeddedLLM/gemma-2b-it-onnx](https://huggingface.co/EmbeddedLLM/gemma-2b-it-onnx) |
-|Llama-2-7b-chat      | 7B         |    4096        | [EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml) |
-|Llama-2-13b-chat      | 13B         |    4096        | [EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml) |
-|Llama-3-8b-chat      | 8B           | 8192           | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
-|Mistral-7b-v0.3-instruct| 7B        |   32768        | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx)|
-| Phi3-mini-4k-instruct | 3.8B      |   4096        | [microsoft/Phi-3-mini-4k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) |
-| Phi3-mini-128k-instruct | 3.8B |  128k | [microsoft/Phi-3-mini-128k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx) |
-| Phi3-medium-4k-instruct | 17B | 4096 | [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml)|
-| Phi3-medium-128k-instruct | 17B | 128k | [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml)|
+
+| Models | Parameters | Context Length | Link |
+| --- | --- | --- | --- |
+| Gemma-2b-Instruct v1 | 2B | 8192 | [EmbeddedLLM/gemma-2b-it-onnx](https://huggingface.co/EmbeddedLLM/gemma-2b-it-onnx) |
+| Llama-2-7b-chat | 7B | 4096 | [EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml) |
+| Llama-2-13b-chat | 13B | 4096 | [EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml](https://huggingface.co/EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml) |
+| Llama-3-8b-chat | 8B | 8192 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
+| Mistral-7b-v0.3-instruct | 7B | 32768 | [EmbeddedLLM/mistral-7b-instruct-v0.3-onnx](https://huggingface.co/EmbeddedLLM/mistral-7b-instruct-v0.3-onnx) |
+| Phi3-mini-4k-instruct | 3.8B | 4096 | [microsoft/Phi-3-mini-4k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx) |
+| Phi3-mini-128k-instruct | 3.8B | 128k | [microsoft/Phi-3-mini-128k-instruct-onnx](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx) |
+| Phi3-medium-4k-instruct | 17B | 4096 | [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml) |
+| Phi3-medium-128k-instruct | 17B | 128k | [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml) |
 
 ## Getting Started
 
 ### Installation
 
 #### From Source
+
 **Windows**
+
 1. Install embeddedllm package. `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
    - **DirectML:** `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml]`
    - **CPU:** `$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu]`
    - **CUDA:** `$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda]`
+   - **With Web UI**:
+     - **DirectML:** `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml, webui]`
+     - **CPU:** `$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu, webui]`
+     - **CUDA:** `$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda, webui]`
 
 **Linux**
+
 1. Install embeddedllm package. `ELLM_TARGET_DEVICE='directml' pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
    - **DirectML:** `ELLM_TARGET_DEVICE='directml' pip install -e .[directml]`
    - **CPU:** `ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu]`
    - **CUDA:** `ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda]`
-
+   - **With Web UI**:
+     - **DirectML:** `ELLM_TARGET_DEVICE='directml' pip install -e .[directml, webui]`
+     - **CPU:** `ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu, webui]`
+     - **CUDA:** `ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda, webui]`
 
 ### Launch OpenAI API Compatible Server
+
 ```
 usage: ellm_server.exe [-h] [--port int] [--host str] [--response_role str] [--uvicorn_log_level str]
                        [--served_model_name str] [--model_path str] [--vision bool]
@@ -72,7 +82,10 @@ options:
 1. `ellm_server --model_path <path/to/model/weight>`.
 2. Example code to connect to the api server can be found in `scripts/python`.
 
+## Launch Chatbot Web UI
 
+1. `ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost`.
 
 ## Acknowledgements
-* Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.
+
+- Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.
diff --git a/requirements-cpu.txt b/requirements-cpu.txt
@@ -1,2 +1,2 @@
 onnxruntime
-onnxruntime-genai
+onnxruntime-genai==0.3.0rc2
diff --git a/requirements-webui.txt b/requirements-webui.txt
@@ -0,0 +1 @@
+gradio~=4.36.1
diff --git a/scripts/python/httpx_client_stream.py b/scripts/python/httpx_client_stream.py
@@ -1,15 +1,30 @@
 import asyncio
 
 import httpx
+import json
 
+def parse_stream(stream:str):
+
+    stream = stream.replace('data: ', '')
+
+    response_obj = json.loads(stream)
+    # print(response_obj)
+
+    return response_obj
 
 async def stream_chat_completion(url: str, payload: dict):
     async with httpx.AsyncClient() as client:
         async with client.stream("POST", url, json=payload) as response:
             if response.status_code == 200:
                 async for data in response.aiter_bytes():
                     if data:
-                        print(data.decode("utf-8"))
+                        decodes_stream = data.decode("utf-8")
+                        if "[DONE]" in decodes_stream:
+                            continue
+                        resp = parse_stream(decodes_stream)
+                        if resp["choices"][0]["delta"].get('content', None):
+                            print(resp["choices"][0]["delta"]["content"], end='', flush=True)
+                        
                         # time.sleep(1)
             else:
                 print(f"Error: {response.status_code}")
@@ -20,9 +35,9 @@ async def stream_chat_completion(url: str, payload: dict):
 if __name__ == "__main__":
     url = "http://localhost:6979/v1/chat/completions"
     payload = {
-        "messages": [{"role": "user", "content": "Hello!"}],
+        "messages": [{"role": "user", "content": "What is the fastest bird on earth?"}],
         "model": "phi3-mini-int4",
-        "max_tokens": 80,
+        "max_tokens": 200,
         "temperature": 0.0,
         "stream": True,
     }
diff --git a/setup.py b/setup.py
@@ -130,6 +130,7 @@ def get_ellm_version() -> str:
     # Add other metadata and dependencies as needed
     extras_require={
         "lint": _read_requirements("requirements-lint.txt"),
+        "webui": _read_requirements("requirements-webui.txt"),
         "cuda": ["onnxruntime-genai-cuda==0.3.0rc2"],
     },
     dependency_links=[
@@ -138,6 +139,7 @@ def get_ellm_version() -> str:
     entry_points={
         "console_scripts": [
             "ellm_server=embeddedllm.entrypoints.api_server:main",
+            "ellm_chatbot=embeddedllm.entrypoints.webui:main",
         ],
     },
 )
diff --git a/src/embeddedllm/entrypoints/api_server.py b/src/embeddedllm/entrypoints/api_server.py
@@ -3,8 +3,8 @@
 from fastapi import FastAPI, Request
 from fastapi.exceptions import RequestValidationError
 from fastapi.responses import JSONResponse, Response, StreamingResponse
-from pydantic_settings import BaseSettings, SettingsConfigDict
 from pydantic import Field
+from pydantic_settings import BaseSettings, SettingsConfigDict
 
 from embeddedllm.entrypoints.chat_server import OpenAPIChatServer
 from embeddedllm.protocol import (  # noqa: E501
diff --git a/src/embeddedllm/entrypoints/chat_server.py b/src/embeddedllm/entrypoints/chat_server.py
@@ -26,29 +26,22 @@
 
 from embeddedllm.engine import EmbeddedLLMEngine
 from embeddedllm.inputs import ImagePixelData, PromptInputs
-from embeddedllm.protocol import (  # noqa: E501
+from embeddedllm.protocol import (  # noqa: E501; ChatCompletionLogProb,; ChatCompletionLogProbs,; ChatCompletionLogProbsContent,; ChatCompletionNamedToolChoiceParam,; CompletionOutput,; FunctionCall,; ToolCall,
     ChatCompletionContentPartParam,
-    # ChatCompletionLogProb,
-    # ChatCompletionLogProbs,
-    # ChatCompletionLogProbsContent,
     ChatCompletionMessageParam,
-    # ChatCompletionNamedToolChoiceParam,
     ChatCompletionRequest,
     ChatCompletionResponse,
     ChatCompletionResponseChoice,
     ChatCompletionResponseStreamChoice,
     ChatCompletionStreamResponse,
     ChatMessage,
-    # CompletionOutput,
     CompletionRequest,
     DeltaMessage,
     ErrorResponse,
-    # FunctionCall,
     ModelCard,
     ModelList,
     ModelPermission,
     RequestOutput,
-    # ToolCall,
     UsageInfo,
 )
 from embeddedllm.utils import decode_base64, random_uuid
diff --git a/src/embeddedllm/entrypoints/webui.py b/src/embeddedllm/entrypoints/webui.py
diff --git a/src/embeddedllm/inputs.py b/src/embeddedllm/inputs.py

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`onnxruntime`
`2`		`-onnxruntime-genai`
	`2`	`+onnxruntime-genai==0.3.0rc2`