Skip to content

Commit 056cf3a

Browse files
Merge branch 'oracle-samples:main' into main
2 parents 00f0f6b + aa68c9b commit 056cf3a

File tree

7 files changed

+620
-102
lines changed

7 files changed

+620
-102
lines changed

LLM/deploy-openai-llm-byoc.md

Lines changed: 265 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,265 @@
1+
# Deploy OpenAI open-source models
2+
3+
This guide demonstrates how to deploy and perform inference using OCI Data Science Service. In this example, we will use a model downloaded from Hugging Face, specifically [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) from OpenAI.
4+
5+
## Required IAM Policies
6+
7+
Add these [policies](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/model-deployment/containers/llama2#required-iam-policies) to grant access to OCI services.
8+
9+
## Setup
10+
11+
Create a data science notebook session with at least 400GB of space. We will use a notebook session to:
12+
1. Download model weights
13+
2. Create Model Catalog entry
14+
3. Deploy the model
15+
16+
To prepare the inference container, we will use local laptop since this step requires Docker commands. The notebook session does not come with the docker tooling.
17+
18+
19+
# Prepare Inference container
20+
21+
vLLM is an easy-to-use library for LLM inference and serving. You can get the container image from [DockerHub](https://hub.docker.com/r/vllm/vllm-openai/tags).
22+
23+
Following commands are to run on your laptop -
24+
25+
```shell
26+
docker pull --platform linux/amd64 vllm/vllm-openai:gptoss
27+
```
28+
29+
Currently, OCI Data Science Model Deployment only supports container images residing in the OCI Registry. Before we can push the pulled vLLM container, make sure you have created a repository in your tenancy.
30+
- Go to your tenancy Container Registry
31+
- Click on the Create repository button
32+
- Select Private under Access types
33+
- Set a name for Repository name. We are using "vllm-odsc "in the example.
34+
- Click on Create button
35+
36+
You may need to docker login to the Oracle Cloud Container Registry (OCIR) first, if you haven't done so before in order to push the image. To login, you have to use your API Auth Token that can be created under your Oracle Cloud Account->Auth Token. You need to login only once. Replace <region> with the OCI region you are using.
37+
38+
```shell
39+
docker login -u '<tenant-namespace>/<username>' <region>.ocir.io
40+
```
41+
42+
If your tenancy is federated with Oracle Identity Cloud Service, use the format <tenancy-namespace>/oracleidentitycloudservice/<username>. You can then push the container image to the OCI Registry:
43+
44+
```shell
45+
docker tag vllm/vllm-openai:gptoss -t <region>.ocir.io/<tenancy>/vllm-odsc/vllm-openai:gptoss
46+
docker push <region>.ocir.io/<tenancy>/vllm-odsc/vllm-openai:gptoss
47+
```
48+
49+
# Deployment
50+
51+
Following steps are to be performed on OCI Notebook Session -
52+
53+
## Prepare The Model Artifacts
54+
55+
To prepare Model artifacts for LLM model deployment:
56+
57+
- Download the model files from huggingface to local directory.
58+
- Upload the model folder to a [versioned bucket](https://docs.oracle.com/en-us/iaas/Content/Object/Tasks/usingversioning.htm) in Oracle Object Storage. If you don’t have an Object Storage bucket, create one using the OCI SDK or the Console. Create an Object Storage bucket. Make a note of the `namespace`, `compartment`, and `bucketname`. Configure the policies to allow the Data Science service to read and write the model artifact to the Object Storage bucket in your tenancy. An administrator must configure the policies in IAM in the Console.
59+
- Create model catalog entry for the model using the Object storage path
60+
61+
### Model Download from HuggingFace Model Hub
62+
63+
[This documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-an-entire-repository) provides more information on using `huggingface-cli` to download an entire repository at a given revision. Models in the HuggingFace hub are stored in their own repository.
64+
65+
66+
```shell
67+
# Select the the model that you want to deploy.
68+
69+
huggingface-cli download openai/gpt-oss-120b --local-dir models/gpt-oss-120b --exclude metal/*
70+
```
71+
72+
Download the titoken file -
73+
74+
```shell
75+
wget -P models/gpt-oss-120b https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
76+
```
77+
## Upload Model to OCI Object Storage
78+
79+
**Note**: **The bucket has to be a versioned bucket**
80+
81+
```shell
82+
oci os object bulk-upload --src-dir models/gpt-oss-120b --prefix gpt-oss-120b/ -bn <bucket_name> -ns <bucket_namespace> --auth "resource_principal"
83+
```
84+
85+
## Create Model by Reference using ADS
86+
87+
```python
88+
# Uncomment this code and set the correct proxy links if have to setup proxy for internet
89+
# import os
90+
# os.environ['http_proxy']="http://myproxy"
91+
# os.environ['https_proxy']="http://myproxy"
92+
93+
# Use os.environ['no_proxy'] to route traffic directly
94+
```
95+
96+
97+
```python
98+
import ads
99+
import os
100+
101+
ads.set_auth("resource_principal")
102+
103+
104+
# Extract region information from the Notebook environment variables and signer.
105+
ads.common.utils.extract_region()
106+
```
107+
108+
```python
109+
# change as required for your environment
110+
compartment_id = os.environ["PROJECT_COMPARTMENT_OCID"]
111+
project_id = os.environ["PROJECT_OCID"]
112+
113+
log_group_id = "ocid1.loggroup.oc1.xxx.xxxxx"
114+
log_id = "ocid1.log.oc1.xxx.xxxxx"
115+
116+
instance_shape = "BM.GPU.H100.8"
117+
118+
region = ads.common.utils.extract_region()
119+
```
120+
121+
```python
122+
from ads.model.datascience_model import DataScienceModel
123+
124+
bucket=<bucket-name>
125+
namespace=<namespace>
126+
127+
artifact_path = f"oci://{bucket}@{namespace}/gpt-oss-120b"
128+
129+
model = (
130+
DataScienceModel()
131+
.with_compartment_id(compartment_id)
132+
.with_project_id(project_id)
133+
.with_display_name("gpt-oss-120b")
134+
.with_artifact(artifact_path)
135+
)
136+
137+
model.create(model_by_reference=True)
138+
```
139+
140+
141+
### Import Model Deployment Modules
142+
143+
```python
144+
from ads.model.deployment import (
145+
ModelDeployment,
146+
ModelDeploymentContainerRuntime,
147+
ModelDeploymentInfrastructure,
148+
ModelDeploymentMode,
149+
)
150+
```
151+
152+
## Setup Model Deployment Infrastructure
153+
154+
```python
155+
container_image = "<region>.ocir.io/<tenancy>/vllm-odsc/vllm-openai:gptoss" # name given to vllm image pushed to Oracle container registry
156+
```
157+
158+
```python
159+
infrastructure = (
160+
ModelDeploymentInfrastructure()
161+
.with_project_id(project_id)
162+
.with_compartment_id(compartment_id)
163+
.with_shape_name(instance_shape)
164+
.with_bandwidth_mbps(10)
165+
.with_replica(1)
166+
.with_web_concurrency(1)
167+
.with_access_log(
168+
log_group_id=log_group_id,
169+
log_id=log_id,
170+
)
171+
.with_predict_log(
172+
log_group_id=log_group_id,
173+
log_id=log_id,
174+
)
175+
)
176+
```
177+
178+
## Configure Model Deployment Runtime
179+
180+
181+
182+
```python
183+
env_var = {
184+
"MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions",
185+
"SHM_SIZE": "10g",
186+
"TIKTOKEN_RS_CACHE_DIR":"/opt/ds/model/gpt-oss-120b"
187+
}
188+
189+
cmd_var = [
190+
"--model",
191+
f"/opt/ds/model/deployed_model/gpt-oss-120b",
192+
"--tensor-parallel-size",
193+
"8",
194+
"--port",
195+
"8080",
196+
"--served-model-name",
197+
"openai/gpt-oss-120b",
198+
"--host",
199+
"0.0.0.0",
200+
"--trust-remote-code",
201+
"--quantization",
202+
"mxfp4"
203+
]
204+
205+
container_runtime = (
206+
ModelDeploymentContainerRuntime()
207+
.with_image(container_image)
208+
.with_server_port(8080)
209+
.with_health_check_port(8080)
210+
.with_env(env_var)
211+
.with_cmd(cmd_var)
212+
.with_deployment_mode(ModelDeploymentMode.HTTPS)
213+
.with_model_uri(model.id)
214+
.with_region(region)
215+
)
216+
```
217+
218+
## Deploy Model using Container Runtime
219+
220+
221+
222+
```python
223+
deployment = (
224+
ModelDeployment()
225+
.with_display_name(f"gpt-oss-120b MD with BYOC")
226+
.with_description(f"Deployment of gpt-oss-120b MD with vLLM BYOC container")
227+
.with_infrastructure(infrastructure)
228+
.with_runtime(container_runtime)
229+
).deploy(wait_for_completion=False)
230+
```
231+
232+
233+
```python
234+
deployment.watch()
235+
```
236+
237+
## Inference
238+
239+
240+
```python
241+
import requests
242+
from string import Template
243+
from datetime import datetime
244+
245+
246+
auth = ads.common.auth.default_signer()["signer"]
247+
prompt = "What amateur radio bands are best to use when there are solar flares? Keep you response to 100 words"
248+
endpoint = f"https://modeldeployment.us-ashburn-1.oci.customer-oci.com/{deployment.model_deployment_id}/predict"
249+
250+
current_date = datetime.now().strftime("%d %B %Y")
251+
252+
body = {
253+
"model": "openai/gpt-oss-120b", # this is a constant
254+
"messages":[
255+
{"role": "user",
256+
"content": prompt
257+
}]
258+
}
259+
requests.post(endpoint, json=body, auth=auth, headers={}).json()
260+
```
261+
262+
#### Output:
263+
264+
265+
During solar flares the ionospheric D‑layer becomes heavily ionized, causing severe absorption of lower HF (3–10 MHz). The most reliable amateur bands are therefore the higher HF bands that are less affected—particularly 15 m (21 MHz), 12 m (24 MHz), 10 m (28 MHz) and the VHF/UHF “line‑of‑sight” bands (50 MHz, 70 MHz, 144 MHz, 432 MHz) which can still work via sporadic E or auroral propagation. If you must use lower HF, stick to the 20 m (14 MHz) band during the flare’s peak, as it often remains usable. Keep power modest and monitor real‑time solar flux indices.

LLM/deployment/container/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ WORKDIR /home/$USERNAME
1919
RUN /opt/conda/bin/conda create -n conda_env python=3.10 pip -y
2020
SHELL ["/opt/conda/bin/conda", "run", "-n", "conda_env", "/bin/bash", "-c"]
2121

22-
ADD requirements.txt /opt/requirements.txt
22+
COPY requirements.txt /opt/requirements.txt
2323
RUN pip install -r /opt/requirements.txt && pip cache purge
24-
ADD app.py /opt/app.py
24+
COPY app.py /opt/app.py
2525

2626
ENV MODEL_DIR="/opt/ds/model/deployed_model"
2727
ENV PATH=/home/$USERNAME/.conda/envs/conda_env/bin:/opt/conda/bin/:$PATH

model-deployment/A2A_agents_on_MD/agent_a/Dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,4 @@ RUN uv sync
1414
EXPOSE 9999
1515

1616
# Run the app
17-
# CMD ["python", "-m", "__main__"]
1817
CMD ["uv", "run", "."]

model-deployment/A2A_agents_on_MD/agent_a/__main__.py

Lines changed: 0 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -151,103 +151,6 @@ async def __call__(self, scope, receive, send):
151151
scope["path"] = "/"
152152
await self.app(scope, receive, send)
153153

154-
# Test Client Function (for standalone testing)
155-
async def test_client_main() -> None:
156-
PUBLIC_AGENT_CARD_PATH = '/.well-known/agent.json'
157-
EXTENDED_AGENT_CARD_PATH = '/agent/authenticatedExtendedCard'
158-
logging.basicConfig(level=logging.INFO)
159-
logger = logging.getLogger(__name__)
160-
161-
base_url = 'https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.tenancy.oc1..aaaaaaaafwgqzxcwlkkpl5i334qpv62s375upsw2j4ufgcizfnnhjd4l55ia/agent-a/predict'
162-
async with httpx.AsyncClient(auth=get_auth(), verify=False, headers={"Content-Length": "0"}) as httpx_client:
163-
resolver = A2ACardResolver(
164-
httpx_client=httpx_client,
165-
base_url=base_url,
166-
)
167-
final_agent_card_to_use: AgentCard | None = None
168-
try:
169-
logger.info(
170-
f'Attempting to fetch public agent card from: {base_url}{PUBLIC_AGENT_CARD_PATH}'
171-
)
172-
_public_card = (
173-
await resolver.get_agent_card()
174-
)
175-
logger.info('Successfully fetched public agent card:')
176-
logger.info(
177-
_public_card.model_dump_json(indent=2, exclude_none=True)
178-
)
179-
final_agent_card_to_use = _public_card
180-
logger.info(
181-
'\nUsing PUBLIC agent card for client initialization (default).'
182-
)
183-
if _public_card.supportsAuthenticatedExtendedCard:
184-
try:
185-
logger.info(
186-
f'\nPublic card supports authenticated extended card. Attempting to fetch from: {base_url}{EXTENDED_AGENT_CARD_PATH}'
187-
)
188-
auth_headers_dict = {
189-
'Authorization': 'Bearer dummy-token-for-extended-card'
190-
}
191-
_extended_card = await resolver.get_agent_card(
192-
relative_card_path=EXTENDED_AGENT_CARD_PATH,
193-
http_kwargs={'headers': auth_headers_dict},
194-
)
195-
logger.info(
196-
'Successfully fetched authenticated extended agent card:'
197-
)
198-
logger.info(
199-
_extended_card.model_dump_json(
200-
indent=2, exclude_none=True
201-
)
202-
)
203-
final_agent_card_to_use = (
204-
_extended_card
205-
)
206-
logger.info(
207-
'\nUsing AUTHENTICATED EXTENDED agent card for client initialization.'
208-
)
209-
except Exception as e_extended:
210-
logger.warning(
211-
f'Failed to fetch extended agent card: {e_extended}. Will proceed with public card.',
212-
exc_info=True,
213-
)
214-
elif (
215-
_public_card
216-
):
217-
logger.info(
218-
'\nPublic card does not indicate support for an extended card. Using public card.'
219-
)
220-
except Exception as e:
221-
logger.error(
222-
f'Critical error fetching public agent card: {e}', exc_info=True
223-
)
224-
raise RuntimeError(
225-
'Failed to fetch the public agent card. Cannot continue.'
226-
) from e
227-
client = A2AClient(
228-
httpx_client=httpx_client, agent_card=final_agent_card_to_use
229-
)
230-
logger.info('A2AClient initialized.')
231-
send_message_payload: dict[str, Any] = {
232-
'message': {
233-
'role': 'user',
234-
'parts': [
235-
{'kind': 'text', 'text': 'how much is 10 USD in INR?'}
236-
],
237-
'messageId': uuid4().hex,
238-
},
239-
}
240-
request = SendMessageRequest(
241-
id=str(uuid4()), params=MessageSendParams(**send_message_payload)
242-
)
243-
response = await client.send_message(request)
244-
print(response.model_dump(mode='json', exclude_none=True))
245-
streaming_request = SendStreamingMessageRequest(
246-
id=str(uuid4()), params=MessageSendParams(**send_message_payload)
247-
)
248-
stream_response = client.send_message_streaming(streaming_request)
249-
async for chunk in stream_response:
250-
print(chunk.model_dump(mode='json', exclude_none=True))
251154

252155
# Main Application
253156
if __name__ == '__main__':

model-deployment/A2A_agents_on_MD/agent_a/test_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ async def main() -> None:
5252
logging.basicConfig(level=logging.INFO)
5353
logger = logging.getLogger(__name__)
5454

55-
base_url = 'https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.tenancy.oc1..aaaaaaaafwgqzxcwlkkpl5i334qpv62s375upsw2j4ufgcizfnnhjd4l55ia/agent-a/predict'
55+
base_url = '<add your custom url mdocid url here taken from the invoke your model screen in model deployment console>'
5656
async with httpx.AsyncClient(auth=get_auth(), verify=False, headers={"Content-Length": "0"}) as httpx_client:
5757
resolver = A2ACardResolver(
5858
httpx_client=httpx_client,

0 commit comments

Comments
 (0)