Skip to content

Commit adcbef5

Browse files
committed
Update README.md
1 parent 1e37d0e commit adcbef5

File tree

12 files changed

+98
-58
lines changed

12 files changed

+98
-58
lines changed

README.md

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ Run local LLMs on iGPU and APU (AMD , Intel, and Qualcomm (Coming Soon))
1616

1717

1818
## 🚀 Latest News
19-
- [2024/06] Support chat inference on iGPU, APU and CPU.
19+
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
20+
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
2021

2122

2223
## Supported Models (Quick Start)
@@ -32,8 +33,46 @@ Run local LLMs on iGPU and APU (AMD , Intel, and Qualcomm (Coming Soon))
3233
| Phi3-medium-4k-instruct | 17B | 4096 | [microsoft/Phi-3-medium-4k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct-onnx-directml)|
3334
| Phi3-medium-128k-instruct | 17B | 128k | [microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml)|
3435

36+
## Getting Started
3537

36-
## Acknowledgements
37-
* Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.
38+
### Installation
39+
40+
#### From Source
41+
**Windows**
42+
1. Install embeddedllm package. `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
43+
- **DirectML:** `$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml]`
44+
- **CPU:** `$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu]`
45+
- **CUDA:** `$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda]`
46+
47+
**Linux**
48+
1. Install embeddedllm package. `ELLM_TARGET_DEVICE='directml' pip install -e .`. Note: currently support `cpu`, `directml` and `cuda`.
49+
- **DirectML:** `ELLM_TARGET_DEVICE='directml' pip install -e .[directml]`
50+
- **CPU:** `ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu]`
51+
- **CUDA:** `ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda]`
52+
53+
54+
### Launch OpenAI API Compatible Server
55+
```
56+
usage: ellm_server.exe [-h] [--port int] [--host str] [--response_role str] [--uvicorn_log_level str]
57+
[--served_model_name str] [--model_path str] [--vision bool]
3858
39-
* Thanks to all the [contributors](./docs/contributors.md).
59+
options:
60+
-h, --help show this help message and exit
61+
--port int Server port. (default: 6979)
62+
--host str Server host. (default: 0.0.0.0)
63+
--response_role str Server response role. (default: assistant)
64+
--uvicorn_log_level str
65+
Uvicorn logging level. `debug`, `info`, `trace`, `warning`, `critical` (default: info)
66+
--served_model_name str
67+
Model name. (default: phi3-mini-int4)
68+
--model_path str Path to model weights. (required)
69+
--vision bool Enable vision capability, only if model supports vision input. (default: False)
70+
```
71+
72+
1. `ellm_server --model_path <path/to/model/weight>`.
73+
2. Example code to connect to the api server can be found in `scripts/python`.
74+
75+
76+
77+
## Acknowledgements
78+
* Excellent open-source projects: [vLLM](https://github.com/vllm-project/vllm.git), [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai.git) and many others.

requirements-common.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,9 @@ fastapi~=0.110.0
33
gunicorn~=21.2.0
44
loguru~=0.7.2
55
numpy~=1.26.4
6-
pydantic-settings>=2.2.1
7-
pydantic~=2.6.3
6+
pydantic-settings>=2.3.3
7+
pydantic-core~=2.18.4
8+
pydantic~=2.7.4
89
loguru
910
openai
1011
torch

requirements-cuda.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
onnxruntime-gpu~=1.18.0
2-
onnxruntime-genai-cuda~=0.2.0
2+
onnxruntime-genai-cuda~=0.3.0rc2

scripts/benchmark/benchmark_api_server.py

Whitespace-only changes.

scripts/python/httpx_client_vision.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ def chat_completion(url: str, payload: dict):
2020

2121
# Example usage
2222
if __name__ == "__main__":
23-
2423
current_file_path = os.path.abspath(__file__)
2524
IMAGE_PATH = os.path.join(os.path.dirname(current_file_path), "..", "images", "catdog.png")
2625

scripts/python/httpx_client_vision_stream.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ async def stream_chat_completion(url: str, payload: dict):
2020

2121
# Example usage
2222
if __name__ == "__main__":
23-
2423
current_file_path = os.path.abspath(__file__)
2524
IMAGE_PATH = os.path.join(os.path.dirname(current_file_path), "..", "images", "catdog.png")
2625

scripts/python/test_prompt_template.py

Lines changed: 0 additions & 16 deletions
This file was deleted.

setup.py

Lines changed: 24 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,21 @@
22
import os
33
import re
44
from typing import List
5+
import platform
56

67
from setuptools import find_packages, setup
78

89
ROOT_DIR = os.path.dirname(__file__)
910

10-
# # Custom function to check for DirectML support
11-
# def check_directml_support():
12-
# if platform.system() != "Windows":
13-
# raise RuntimeError("This package requires a Windows system with DirectML support.")
14-
# # Add additional checks for DirectML support if necessary
1511

16-
# # Run the check before proceeding with the setup
17-
# check_directml_support()
12+
ELLM_TARGET_DEVICE = os.environ.get("ELLM_TARGET_DEVICE", "directml")
1813

19-
ELLM_TARGET_DEVICE = "cuda"
14+
15+
# Custom function to check for DirectML support
16+
def check_directml_support():
17+
if platform.system() != "Windows":
18+
raise RuntimeError("This package requires a Windows system with DirectML support.")
19+
# Add additional checks for DirectML support if necessary
2020

2121

2222
def read_readme() -> str:
@@ -29,6 +29,8 @@ def read_readme() -> str:
2929

3030

3131
def _is_directml() -> bool:
32+
# Run the check before proceeding with the setup
33+
check_directml_support()
3234
return ELLM_TARGET_DEVICE == "directml"
3335

3436

@@ -97,6 +99,8 @@ def get_ellm_version() -> str:
9799
return version
98100

99101

102+
print(get_requirements().extend(_read_requirements("requirements-common.txt")))
103+
100104
setup(
101105
name="embeddedllm",
102106
version=get_ellm_version(),
@@ -120,9 +124,20 @@ def get_ellm_version() -> str:
120124
"License :: OSI Approved :: Apache Software License",
121125
"Topic :: Scientific/Engineering :: Artificial Intelligence",
122126
],
123-
install_requires=get_requirements().extend(_read_requirements("requirements-common.txt")),
127+
install_requires=get_requirements()
128+
+ _read_requirements("requirements-common.txt")
129+
+ _read_requirements("requirements-build.txt"),
124130
# Add other metadata and dependencies as needed
125131
extras_require={
126132
"lint": _read_requirements("requirements-lint.txt"),
133+
"cuda": ["onnxruntime-genai-cuda==0.3.0rc2"],
134+
},
135+
dependency_links=[
136+
"https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/"
137+
],
138+
entry_points={
139+
"console_scripts": [
140+
"ellm_server=embeddedllm.entrypoints.api_server:main",
141+
],
127142
},
128143
)

src/embeddedllm/engine.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,6 @@ def onnx_generator_context(model, params):
3636

3737

3838
class EmbeddedLLMEngine:
39-
4039
def __init__(self, model_path: str, vision: bool):
4140
self.model_path = model_path
4241
self.model_config = AutoConfig.from_pretrained(self.model_path, trust_remote_code=True)
@@ -103,10 +102,8 @@ async def generate_vision(
103102
request_id: str,
104103
stream: bool = True,
105104
) -> AsyncIterator[RequestOutput]:
106-
107105
prompt_text = inputs["prompt"]
108106
# print(f"inputs: {str(inputs)}")
109-
print(inputs.keys())
110107
input_tokens = self.onnx_tokenizer.encode(prompt_text)
111108
# logger.debug(f"inputs: {str(inputs)}")
112109
# logger.debug(f'inputs["multi_model_data"]: {str(inputs.multi_model_data)}')

src/embeddedllm/entrypoints/api_server.py

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from fastapi.exceptions import RequestValidationError
55
from fastapi.responses import JSONResponse, Response, StreamingResponse
66
from pydantic_settings import BaseSettings, SettingsConfigDict
7+
from pydantic import Field
78

89
from embeddedllm.entrypoints.chat_server import OpenAPIChatServer
910
from embeddedllm.protocol import ( # noqa: E501
@@ -18,14 +19,21 @@
1819

1920

2021
class Config(BaseSettings):
21-
model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8", extra="ignore")
22-
port: int = 6979
23-
host: str = "0.0.0.0"
24-
response_role: str = "assistant"
25-
uvicorn_log_level: str = "info"
26-
served_model_name: str = "phi3-mini-int4"
27-
model_path: str = None
28-
vision: bool = False
22+
model_config = SettingsConfigDict(
23+
env_file=".env", env_file_encoding="utf-8", extra="ignore", cli_parse_args=True
24+
)
25+
port: int = Field(default=6979, description="Server port.")
26+
host: str = Field(default="0.0.0.0", description="Server host.")
27+
response_role: str = Field(default="assistant", description="Server response role.")
28+
uvicorn_log_level: str = Field(
29+
default="info",
30+
description="Uvicorn logging level. `debug`, `info`, `trace`, `warning`, `critical`",
31+
)
32+
served_model_name: str = Field(default="phi3-mini-int4", description="Model name.")
33+
model_path: str = Field(description="Path to model weights.")
34+
vision: bool = Field(
35+
default=False, description="Enable vision capability, only if model supports vision input."
36+
)
2937

3038

3139
config = Config()
@@ -52,20 +60,18 @@ async def show_available_models():
5260

5361
@app.post("/v1/chat/completions")
5462
async def create_chat_completion(request: ChatCompletionRequest, raw_request: Request):
55-
5663
generator = await openai_chat_server.create_chat_completion(request, raw_request)
5764
if isinstance(generator, ErrorResponse):
5865
return JSONResponse(content=generator.model_dump(), status_code=generator.code)
5966
if request.stream:
6067
return StreamingResponse(content=generator, media_type="text/event-stream")
6168
else:
62-
# return JSONResponse(content="Non-streaming Chat Generation Yet to be Implemented.",
63-
# status_code=404)
6469
assert isinstance(generator, ChatCompletionResponse)
6570
return JSONResponse(content=generator.model_dump())
6671

6772

68-
if __name__ == "__main__":
73+
def main():
74+
global openai_chat_server
6975
import os
7076

7177
import uvicorn
@@ -90,3 +96,7 @@ async def create_chat_completion(request: ChatCompletionRequest, raw_request: Re
9096
uvicorn.run(
9197
app, host=config.host, port=config.port, log_level=config.uvicorn_log_level, loop="asyncio"
9298
)
99+
100+
101+
if __name__ == "__main__":
102+
main()

0 commit comments

Comments
 (0)