diff --git a/ai-quick-actions/model-deployment-tips.md b/ai-quick-actions/model-deployment-tips.md index 90547d3e..05a6b875 100644 --- a/ai-quick-actions/model-deployment-tips.md +++ b/ai-quick-actions/model-deployment-tips.md @@ -9,6 +9,8 @@ Table of Contents: - [Model Evaluation](evaluation-tips.md) - [Model Registration](register-tips.md) - [Multi Modal Inferencing](multimodal-models-tips.md) +- [Multi Model Inferencing](multimodal-models-tips.md) +- [Stacked Model Inferencing](stacked-deployment-tips.md) - [Private_Endpoints](model-deployment-private-endpoint-tips.md) - [Tool Calling](model-deployment-tool-calling-tips.md) @@ -918,4 +920,4 @@ Table of Contents: - [Model Registration](register-tips.md) - [Multi Modal Inferencing](multimodal-models-tips.md) - [Private_Endpoints](model-deployment-private-endpoint-tips.md) -- [Tool Calling](model-deployment-tool-calling-tips.md) \ No newline at end of file +- [Tool Calling](model-deployment-tool-calling-tips.md) diff --git a/ai-quick-actions/multimodel-deployment-tips.md b/ai-quick-actions/multimodel-deployment-tips.md index 183afc03..38e415ad 100644 --- a/ai-quick-actions/multimodel-deployment-tips.md +++ b/ai-quick-actions/multimodel-deployment-tips.md @@ -63,6 +63,8 @@ For fine-tuned models, requests specifying the base model name (ex. model: meta- - [CLI Output](#cli-output-3) - [Create Multi-Model (1 Embedding Model, 1 LLM) deployment with `/v1/completions`](#create-multi-model-1-embedding-model-1-llm-deployment-with-v1completions) - [Manage Multi-Model Deployments](#manage-multi-model-deployments) + - [List Multi-Model Deployments](#list-multi-model-deployments) + - [Edit Multi-Model Deployments](#edit-multi-model-deployments) - [Multi-Model Inferencing](#multi-model-inferencing) - [Using oci-cli](#using-oci-cli) - [Using Python SDK (without streaming)](#using-python-sdk-without-streaming) @@ -101,16 +103,22 @@ Only Multi-Model Deployments with **base service LLM models (text-generation)** ### Select 'Deploy Multi Model' - Based on the 'models' field, a Compute Shape will be recommended to accomidate both models. +- Select the 'Fine Tuned Weights'. + - Only fine tuned model with version `V2` is allowed to be deployed as weights in Multi Deployment. For deploying old fine tuned model weight, run the following command to convert it to version `V2` and apply the new fine tuned model to the deployment creation. This command deletes the old fine tuned model by default after conversion but you can add ``--delete_model False`` to keep it instead. + + ```bash + ads aqua model convert_fine_tune --model_id [FT_OCID] + ``` - Select logging and endpoints (/v1/completions | /v1/chat/completions). - Submit form via 'Deploy Button' at bottom. -![mmd-form](web_assets/deploy-mmd.png) +![mmd-form](web_assets/deploy-multi.png) ### Inferencing with Multi-Model Deployment There are two ways to send inference requests to models within a Multi-Model Deployment 1. Python SDK (recommended)- see [here](#Multi-Model-Inferencing) -2. Using AQUA UI (see below, ok for testing) +2. Using AQUA UI - see [here](#using-aqua-ui-interface-for-multi-model-deployment) Once the Deployment is Active, view the model deployment details and inferencing form by clicking on the 'Deployments' Tab and selecting the model within the Model Deployment list. @@ -472,8 +480,13 @@ ads aqua deployment get_multimodel_deployment_config --model_ids '["ocid1.datasc ## 3. Create Multi-Model Deployment -Only **base service LLM models** are supported for MultiModel Deployment. All selected models will run on the same **GPU shape**, sharing the available compute resources. Make sure to choose a shape that meets the needs of all models in your deployment using [MultiModel Configuration command](#get-multimodel-configuration) +All selected models will run on the same **GPU shape**, sharing the available compute resources. Make sure to choose a shape that meets the needs of all models in your deployment using [MultiModel Configuration command](#get-multimodel-configuration) + +Only fine tuned model with version `V2` is allowed to be deployed as weights in Multi Deployment. For deploying old fine tuned model weight, run the following command to convert it to version `V2` and apply the new fine tuned model OCID to the deployment creation. This command deletes the old fine tuned model by default after conversion but you can add ``--delete_model False`` to keep it instead. +```bash +ads aqua model convert_fine_tune --model_id [FT_OCID] +``` ### Description @@ -750,6 +763,144 @@ To list all AQUA deployments (both Multi-Model and single-model) within a specif Note: Multi-Model deployments are identified by the tag `"aqua_multimodel": "true",` associated with them. +### Edit Multi-Model Deployments + +AQUA deployment must be in `ACTIVE` state to be updated and can only be updated one at a time for the following option groups. There are two ways to update model deployment: `ZDT` and `LIVE`. The default update type for AQUA deployment is `ZDT` but `LIVE` will be adopted if `models` are changed in multi deployment. + + - `Name or description`: Change the name or description. + - `Default configuration`: Change or add freeform and defined tags. + - `Models`: Change the model. + - `Compute`: Change the number of CPUs or amount of memory for each CPU in gigabytes. + - `Logging`: Change the logging configuration for access and predict logs. + - `Load Balancer`: Change the load balancing bandwidth. + +#### Usage + +```bash +ads aqua deployment update [OPTIONS] +``` + +#### Required Parameters + +`--model_deployment_id [str]` + +The model deployment OCID to be updated. + +#### Optional Parameters + +`--models [str]` + +The String representation of a JSON array, where each object defines a model’s OCID and the number of GPUs assigned to it. The gpu count should always be a **power of two (e.g., 1, 2, 4, 8)**.
+Example: `'[{"model_id":"", "gpu_count":1},{"model_id":"", "gpu_count":1}]'` for `VM.GPU.A10.2` shape.
+ +`--display_name [str]` + +The name of model deployment. + +`--description [str]` + +The description of the model deployment. Defaults to None. + +`--instance_count [int]` + +The number of instance used for model deployment. Defaults to 1. + +`--log_group_id [str]` + +The oci logging group id. The access log and predict log share the same log group. + +`--access_log_id [str]` + +The access log OCID for the access logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--predict_log_id [str]` + +The predict log OCID for the predict logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--web_concurrency [int]` + +The number of worker processes/threads to handle incoming requests. + +`--bandwidth_mbps [int]` + +The bandwidth limit on the load balancer in Mbps. + +`--memory_in_gbs [float]` + +Memory (in GB) for the selected shape. + +`--ocpus [float]` + +OCPU count for the selected shape. + +`--freeform_tags [dict]` + +Freeform tags for model deployment. + +`--defined_tags [dict]` +Defined tags for model deployment. + +#### Example + +##### Edit Multi-Model deployment with `/v1/completions` + +```bash +ads aqua deployment update \ + --model_deployment_id "ocid1.datasciencemodeldeployment.oc1.iad." \ + --models '[{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name":"test_updated_model_name", "gpu_count":2}]' \ + --display_name "updated_modelDeployment_multmodel_model1_model2" + +``` + +##### CLI Output + +```json +{ + "id": "ocid1.datasciencemodeldeployment.oc1.iad.", + "display_name": "updated_modelDeployment_multmodel_model1_model2", + "aqua_service_model": false, + "model_id": "ocid1.datasciencemodelgroup.oc1.iad.", + "models": [ + { + "model_id": "ocid1.datasciencemodel.oc1.iad.", + "model_name": "mistralai/Mistral-7B-v0.1", + "gpu_count": 1, + "env_var": {} + }, + { + "model_id": "ocid1.datasciencemodel.oc1.iad.", + "model_name": "tiiuae/falcon-7b", + "gpu_count": 1, + "env_var": {} + } + ], + "aqua_model_name": "", + "state": "UPDATING", + "description": null, + "created_on": "2025-03-10 19:09:40.793000+00:00", + "created_by": "ocid1.user.oc1..", + "endpoint": "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.", + "private_endpoint_id": null, + "console_link": "https://cloud.oracle.com/data-science/model-deployments/ocid1.datasciencemodeldeployment.oc1.iad.", + "lifecycle_details": null, + "shape_info": { + "instance_shape": "VM.GPU.A10.2", + "instance_count": 1, + "ocpus": null, + "memory_in_gbs": null + }, + "tags": { + "aqua_model_id": "ocid1.datasciencemodelgroup.oc1.", + "aqua_multimodel": "true", + "OCI_AQUA": "active" + }, + "environment_variables": { + "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions", + "MODEL_DEPLOY_ENABLE_STREAMING": "true", + }, +} +``` + # Multi-Model Inferencing The only change required to infer a specific model from a Multi-Model deployment is to update the value of `"model"` parameter in the request payload. The values for this parameter can be found in the Model Deployment details, under the field name `"model_name"`. This parameter segregates the request flow, ensuring that the inference request is directed to the correct model within the MultiModel deployment. diff --git a/ai-quick-actions/stacked-deployment-tips.md b/ai-quick-actions/stacked-deployment-tips.md new file mode 100644 index 00000000..0e13ccee --- /dev/null +++ b/ai-quick-actions/stacked-deployment-tips.md @@ -0,0 +1,837 @@ +# **AI Quick Actions Stacked Deployment** + +# Table of Contents +- # Introduction to Stacked Deployment and Serving +- [Models](#models) + - [Fine Tuned Models](#fine-tuned-models) +- [Stacked Deployment](#stacked-deployment) + - [Create Stacked Deployment via AQUA UI](#create-stacked-deployment-via-aqua-ui) + - [Create Stacked Deployment via ADS CLI](#create-stacked-deployment-via-ads-cli) + - [Manage Stacked Deployments](#manage-stacked-deployments) + - [List Stacked Deployments](#list-stacked-deployments) + - [Edit Stacked Deployments](#edit-stacked-deployments) +- [Stacked Model Inferencing](#stacked-model-inferencing) +- [Stacked Model Evaluation](#stacked-model-evaluation) + - [Create Model Evaluations](#create-model-evaluations) + +# Introduction to Stacked Deployment and Serving + +Stacked Model Deployment enables deploying a base model alongside multiple fine-tuned weights within the same deployment. During inference, responses can be generated using either the base model or the associated fine-tuned weights, depending on the request. The Data Science server has prebuilt **vLLM service container** that make deploying and serving stacked large language model very easy, simplifying the deployment process and reducing operational complexity. This container comes with **VLLM's native routing** which routes requests to the appropriate model, ensuring seamless prediction. + +This document provides documentation on how to create stacked deployment using AI Quick Actions (AQUA) model deployments, and evaluate the models. + +# Models + +First step in process is to get the OCIDs of the desired base service LLM AQUA models, which are required to initiate the stacked deployment process. Refer to [AQUA CLI tips](cli-tips.md) for detailed instructions on how to obtain the OCIDs of base service LLM AQUA models. + +You can also obtain the OCID from the AQUA user interface by clicking on the model card and selecting the `Copy OCID` button from the `More Options` dropdown in the top-right corner of the screen. + +## Fine Tuned Models + +Only fine tuned model with version `V2` is allowed to be deployed as weights in Stacked Deployment. For deploying old fine tuned model weight, run the following command to convert it to version `V2` and apply the new fine tuned model OCID to the deployment creation. This command deletes the old fine tuned model by default after conversion but you can add ``--delete_model False`` to keep it instead. + +```bash +ads aqua model convert_fine_tune --model_id [FT_OCID] +``` + +If fine tuned model `V2` is deployed as single deployment, AQUA will fetch its base model, attach it as weight and deploy them as stack deployment instead. + +# Stacked Deployment + +## Create Stacked Deployment via AQUA UI + +### Create Stack Deployment + +Open AQUA UI and navigate to the `Deployments` tab. Click `Create Deployment` on the upper right and you should see the following page. Select `Deploy Model Stack` and select the service model and its corresponding fine tuned weights. You can customize the inference keys for each service and fine tuned model. + +![Deploy Model](web_assets/deploy-stack.png) + +### Compute Shape + +The compute shape selection is critical, the list available is selected to be suitable for the +chosen model. + +- VM.GPU.A10.1 has 24GB of GPU memory and 240GB of CPU memory. The limiting factor is usually the +GPU memory which needs to be big enough to hold the model. +- VM.GPU.A10.2 has 48GB GPU memory +- BM.GPU.A10.4 has 96GB GPU memory and runs on a bare metal machine, rather than a VM. + +For a full list of shapes and their definitions see the [compute shape docs](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm) + +The relationship between model parameter size and GPU memory is roughly 2x parameter count in GB, so for example a model that has 7B parameters will need a minimum of 14 GB for inference. At runtime the +memory is used for both holding the weights, along with the concurrent contexts for the user's requests. + +### Advanced Options + +You may click on the "Show Advanced Options" to configure options for "inference container". + +![Advanced Options](web_assets/deploy-stack-model-advanced-options.png) + +### Inference Container Configuration + +The service allows for model deployment configuration to be overridden when creating a model deployment. Depending on +the type of inference container used for deployment, i.e. vLLM or TGI, the parameters vary and need to be passed with the format +`(--param-name, param-value)`. + +For more details, please visit [vLLM](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server) documentation to know more about the parameters accepted by the respective containers. + +## Create Stacked Deployment via ADS CLI + +### Description + +You'll need the latest version of ADS to create a new Aqua Stacked deployment. Installation instructions are available [here](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/cli/quickstart.html). + +### Usage + +```bash +ads aqua deployment create [OPTIONS] +``` + +### Required Parameters + +`--models [str]` + +The String representation of a JSON array, where each object defines a model OCID, model name and its associating fine tuned weights. The model names are used to reference specific models during inference requests and support a [maximum length of 32 characters](https://docs.oracle.com/en-us/iaas/Content/data-science/using/models-mms-top.htm#models-mms-key-concepts). Model OCID will be used for inferencing if no model name is provided. Only **one** base model is allowed for creating stacked deployment
+Example: `'[{"model_id":"", "model_name":"", "fine_tune_weights": [{"model_id": "", "model_name":""},{"model_id":"", "model_name": ""}]}]'` for `VM.GPU.A10.2` shape.
+ + +`--instance_shape [str]` + +The shape (GPU) of the instance used for model deployment.
+Example: `VM.GPU.A10.2, BM.GPU.A10.4, BM.GPU4.8, BM.GPU.A100-v2.8`. + +`--display_name [str]` + +The name of model deployment. + +`--container_image_uri [str]` + +The URI of the inference container associated with the model being registered. In case of Stacked, the value is vLLM container URI.
+Example: `dsmc://odsc-vllm-serving:0.6.4.post1.2` or `dsmc://odsc-vllm-serving:0.8.1.2` + +`--deployment_type [str]` + +The deployment type for creating model deployment. In case of Stacked, the value must be `STACKED`. Failing to provide `--deployment_type` will result in creating multi model deployment instead. + +### Optional Parameters + +`--compartment_id [str]` + +The compartment OCID where model deployment is to be created. If not provided, then it defaults to user's compartment. + +`--project_id [str]` + +The project OCID where model deployment is to be created. If not provided, then it defaults to user's project. + +`--description [str]` + +The description of the model deployment. Defaults to None. + +`--instance_count [int]` + +The number of instance used for model deployment. Defaults to 1. + +`--log_group_id [str]` + +The oci logging group id. The access log and predict log share the same log group. + +`--access_log_id [str]` + +The access log OCID for the access logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--predict_log_id [str]` + +The predict log OCID for the predict logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--web_concurrency [int]` + +The number of worker processes/threads to handle incoming requests. + +`--server_port [int]` + +The server port for docker container image. Defaults to 8080. + +`--health_check_port [int]` + +The health check port for docker container image. Defaults to 8080. + +`--env_var [dict]` + +Environment variable for the model deployment, defaults to None. + +`--private_endpoint_id [str]` + +The private endpoint id of model deployment. + +### Example + +#### Create Stacked deployment with `/v1/completions` + +```bash +ads aqua deployment create \ + --container_image_uri "dsmc://odsc-vllm-serving:0.6.4.post1.2" \ + --models '[{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name":"test_model_name", "fine_tune_weights": [{"model_id": "ocid1.datasciencemodel.oc1.iad.", "model_name":"test_ft_name_one"},{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name": "test_ft_name_two"}]}]' \ + --instance_shape "VM.GPU.A10.1" \ + --display_name "modelDeployment_stacked_model" + --deployment_type "STACKED" + +``` + +##### CLI Output + +```json +{ + "id": "ocid1.datasciencemodeldeployment.oc1.iad.", + "display_name": "modelDeployment_stacked_model", + "aqua_service_model": false, + "model_id": "ocid1.datasciencemodelgroup.oc1.iad.", + "models": [], + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "state": "CREATING", + "description": null, + "created_on": "2025-10-13 17:48:53.416000+00:00", + "created_by": "ocid1.user.oc1..", + "endpoint": "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.", + "private_endpoint_id": null, + "console_link": "https://cloud.oracle.com/data-science/model-deployments/ocid1.datasciencemodeldeployment.oc1.iad.", + "lifecycle_details": null, + "shape_info": { + "instance_shape": "VM.GPU.A10.1", + "instance_count": 1, + "ocpus": null, + "memory_in_gbs": null + }, + "tags": { + "task": "text_generation", + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "OCI_AQUA": "active" + }, + "environment_variables": { + "BASE_MODEL": "service_models/Meta-Llama-3.1-8B-Instruct/5206a32/artifact", + "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "true", + "MODEL": "/opt/ds/model/deployed_model/ocid1.datasciencemodel.oc1.iad./", + "PARAMS": "--served-model-name test_model_name --disable-custom-all-reduce --seed 42 --max-model-len 4096 --max-lora-rank 32 --enable_lora", + "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/completions", + "MODEL_DEPLOY_ENABLE_STREAMING": "true", + "PORT": "8080", + "HEALTH_CHECK_PORT": "8080", + "AQUA_TELEMETRY_BUCKET_NS": "ociodscdev", + "AQUA_TELEMETRY_BUCKET": "service-managed-models" + }, + "cmd": [] +} +``` + +#### Create Stacked deployment with `/v1/chat/completions` + +```bash +ads aqua deployment create \ + --container_image_uri "dsmc://odsc-vllm-serving:0.6.4.post1.2" \ + --models '[{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name":"test_model_name", "fine_tune_weights": [{"model_id": "ocid1.datasciencemodel.oc1.iad.", "model_name":"test_ft_name_one"},{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name": "test_ft_name_two"}]}]' \ + --env-var '{"MODEL_DEPLOY_PREDICT_ENDPOINT":"/v1/chat/completions"}' \ + --instance_shape "VM.GPU.A10.1" \ + --display_name "modelDeployment_stacked_model" + --deployment_type "STACKED" + +``` + +##### CLI Output + +```json +{ + "id": "ocid1.datasciencemodeldeployment.oc1.iad.", + "display_name": "modelDeployment_stacked_model", + "aqua_service_model": false, + "model_id": "ocid1.datasciencemodelgroup.oc1.iad.", + "models": [], + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "state": "CREATING", + "description": null, + "created_on": "2025-10-13 17:48:53.416000+00:00", + "created_by": "ocid1.user.oc1..", + "endpoint": "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.", + "private_endpoint_id": null, + "console_link": "https://cloud.oracle.com/data-science/model-deployments/ocid1.datasciencemodeldeployment.oc1.iad.", + "lifecycle_details": null, + "shape_info": { + "instance_shape": "VM.GPU.A10.1", + "instance_count": 1, + "ocpus": null, + "memory_in_gbs": null + }, + "tags": { + "task": "text_generation", + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "OCI_AQUA": "active" + }, + "environment_variables": { + "BASE_MODEL": "service_models/Meta-Llama-3.1-8B-Instruct/5206a32/artifact", + "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "true", + "MODEL": "/opt/ds/model/deployed_model/ocid1.datasciencemodel.oc1.iad./", + "PARAMS": "--served-model-name test_model_name --disable-custom-all-reduce --seed 42 --max-model-len 4096 --max-lora-rank 32 --enable_lora", + "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/chat/completions", + "MODEL_DEPLOY_ENABLE_STREAMING": "true", + "PORT": "8080", + "HEALTH_CHECK_PORT": "8080", + "AQUA_TELEMETRY_BUCKET_NS": "ociodscdev", + "AQUA_TELEMETRY_BUCKET": "service-managed-models" + }, + "cmd": [] +} +``` + +## Manage Stacked Deployments + +### List Stacked Deployments + +To list all AQUA deployments (all Stacked, MultiModel and single-model) within a specified compartment or project, or to get detailed information on a specific Stacked deployment, kindly refer to the [AQUA CLI tips](cli-tips.md) documentation. + +Note: Stacked deployments are identified by the tag `"aqua_stacked_model": "true",` associated with them. + +### Edit Stacked Deployments + +AQUA deployment must be in `ACTIVE` state to be updated and can only be updated one at a time for the following option groups. There are two ways to update model deployment: `ZDT` and `LIVE`. The default update type for AQUA deployment is `ZDT` but `LIVE` will be adopted if `models` are changed in stacked deployment. + + - `Name or description`: Change the name or description. + - `Default configuration`: Change or add freeform and defined tags. + - `Models`: Change the model. + - `Compute`: Change the number of CPUs or amount of memory for each CPU in gigabytes. + - `Logging`: Change the logging configuration for access and predict logs. + - `Load Balancer`: Change the load balancing bandwidth. + +#### Usage + +```bash +ads aqua deployment update [OPTIONS] +``` + +#### Required Parameters + +`--model_deployment_id [str]` + +The model deployment OCID to be updated. + +#### Optional Parameters + +`--models [str]` + +The String representation of a JSON array, where each object defines a model OCID, model name and its associating fine tuned weights. The model names are used to reference specific models during inference requests and support a [maximum length of 32 characters](https://docs.oracle.com/en-us/iaas/Content/data-science/using/models-mms-top.htm#models-mms-key-concepts). Only **one** base model is allowed for updating stacked deployment
+Example: `'[{"model_id":"", "model_name":"", "fine_tune_weights": [{"model_id": "", "model_name":""},{"model_id":"", "model_name": ""}]}]'` for `VM.GPU.A10.2` shape.
+ +`--display_name [str]` + +The name of model deployment. + +`--description [str]` + +The description of the model deployment. Defaults to None. + +`--instance_count [int]` + +The number of instance used for model deployment. Defaults to 1. + +`--log_group_id [str]` + +The oci logging group id. The access log and predict log share the same log group. + +`--access_log_id [str]` + +The access log OCID for the access logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--predict_log_id [str]` + +The predict log OCID for the predict logs. Check [model deployment logging](https://docs.oracle.com/en-us/iaas/data-science/using/model_dep_using_logging.htm) for more details. + +`--web_concurrency [int]` + +The number of worker processes/threads to handle incoming requests. + +`--bandwidth_mbps [int]` + +The bandwidth limit on the load balancer in Mbps. + +`--memory_in_gbs [float]` + +Memory (in GB) for the selected shape. + +`--ocpus [float]` + +OCPU count for the selected shape. + +`--freeform_tags [dict]` + +Freeform tags for model deployment. + +`--defined_tags [dict]` +Defined tags for model deployment. + +#### Example + +##### Edit Stacked deployment with `/v1/completions` + +```bash +ads aqua deployment update \ + --model_deployment_id "ocid1.datasciencemodeldeployment.oc1.iad." \ + --models '[{"model_id":"ocid1.datasciencemodel.oc1.iad.", "model_name":"test_updated_model_name"}]' \ + --display_name "updated_modelDeployment_stacked_model" + +``` + +##### CLI Output + +```json +{ + "id": "ocid1.datasciencemodeldeployment.oc1.iad.", + "display_name": "updated_modelDeployment_stacked_model", + "aqua_service_model": false, + "model_id": "ocid1.datasciencemodelgroup.oc1.iad.", + "models": [], + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "state": "UPDATING", + "description": null, + "created_on": "2025-10-13 17:48:53.416000+00:00", + "created_by": "ocid1.user.oc1..", + "endpoint": "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.", + "private_endpoint_id": null, + "console_link": "https://cloud.oracle.com/data-science/model-deployments/ocid1.datasciencemodeldeployment.oc1.iad.", + "lifecycle_details": null, + "shape_info": { + "instance_shape": "VM.GPU.A10.1", + "instance_count": 1, + "ocpus": null, + "memory_in_gbs": null + }, + "tags": { + "task": "text_generation", + "aqua_model_name": "meta-llama/Meta-Llama-3.1-8B-Instruct", + "OCI_AQUA": "active" + }, + "environment_variables": { + "BASE_MODEL": "service_models/Meta-Llama-3.1-8B-Instruct/5206a32/artifact", + "VLLM_ALLOW_RUNTIME_LORA_UPDATING": "true", + "MODEL": "/opt/ds/model/deployed_model/ocid1.datasciencemodel.oc1.iad./", + "PARAMS": "--served-model-name test_updated_model_name --disable-custom-all-reduce --seed 42 --max-model-len 4096 --max-lora-rank 32 --enable_lora", + "MODEL_DEPLOY_PREDICT_ENDPOINT": "/v1/completions", + "MODEL_DEPLOY_ENABLE_STREAMING": "true", + "PORT": "8080", + "HEALTH_CHECK_PORT": "8080", + "AQUA_TELEMETRY_BUCKET_NS": "ociodscdev", + "AQUA_TELEMETRY_BUCKET": "service-managed-models" + }, + "cmd": [] +} +``` + +# Stacked Model Inferencing + +The only change required to infer a specific model from a Stacked deployment is to update the value of `"model"` parameter in the request payload. The values for this parameter can be found in the Model Deployment details, under the field name `"model_name"`. This parameter segregates the request flow, ensuring that the inference request is directed to the correct model within the Stacked deployment. + +## Using AQUA UI + +![Inferencing](web_assets/try-stack-model.png) + +## Using oci-cli + +```bash +oci raw-request \ + --http-method POST \ + --target-uri /predict \ + --request-body '{ + "model": "", + "prompt": "what are activation functions?", + "max_tokens": 250, + "temperature": 0.7, + "top_p": 0.8 + }' \ + --auth + +``` + +Note: Currently `oci-cli` does not support streaming response, use Python or Java SDK instead. + +## Using Python SDK (without streaming) + +```python +# The OCI SDK must be installed for this example to function properly. +# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm + +import requests +import oci +from oci.signer import Signer +from oci.config import from_file + +config = from_file('~/.oci/config') +auth = Signer( + tenancy=config['tenancy'], + user=config['user'], + fingerprint=config['fingerprint'], + private_key_file_location=config['key_file'], + pass_phrase=config['pass_phrase'] +) + +# For security token based authentication +# token_file = config['security_token_file'] +# token = None +# with open(token_file, 'r') as f: +# token = f.read() +# private_key = oci.signer.load_private_key_from_file(config['key_file']) +# auth = oci.auth.signers.SecurityTokenSigner(token, private_key) + +model = "" + +endpoint = "https://modeldeployment.us-ashburn-1.oci.oc-test.com/ocid1.datasciencemodeldeployment.oc1.iad.xxxxxxxxx/predict" +body = { + "model": model, # this is a constant + "prompt": "what are activation functions?", + "max_tokens": 250, + "temperature": 0.7, + "top_p": 0.8, +} + +res = requests.post(endpoint, json=body, auth=auth, headers={}).json() + +print(res) +``` + +## Using Python SDK (with streaming) + +To consume streaming Server-sent Events (SSE), install [sseclient-py](https://pypi.org/project/sseclient-py/) using `pip install sseclient-py`. + +```python +# The OCI SDK must be installed for this example to function properly. +# Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/pythonsdk.htm + +import requests +import oci +from oci.signer import Signer +from oci.config import from_file +import sseclient # pip install sseclient-py + +config = from_file('~/.oci/config') +auth = Signer( + tenancy=config['tenancy'], + user=config['user'], + fingerprint=config['fingerprint'], + private_key_file_location=config['key_file'], + pass_phrase=config['pass_phrase'] +) + +# For security token based authentication +# token_file = config['security_token_file'] +# token = None +# with open(token_file, 'r') as f: +# token = f.read() +# private_key = oci.signer.load_private_key_from_file(config['key_file']) +# auth = oci.auth.signers.SecurityTokenSigner(token, private_key) + +model = "" + +endpoint = "https://modeldeployment.us-ashburn-1.oci.oc-test.com/ocid1.datasciencemodeldeployment.oc1.iad.xxxxxxxxx/predict" +body = { + "model": model, # this is a constant + "prompt": "what are activation functions?", + "max_tokens": 250, + "temperature": 0.7, + "top_p": 0.8, + "stream": True, +} + +headers={'Content-Type':'application/json','enable-streaming':'true', 'Accept': 'text/event-stream'} +response = requests.post(endpoint, json=body, auth=auth, stream=True, headers=headers) + +print(response.headers) + +client = sseclient.SSEClient(response) +for event in client.events(): + print(event.data) + +# Alternatively, we can use the below code to print the response. +# for line in response.iter_lines(): +# if line: +# print(line) +``` + +## Using Python SDK for /v1/chat/completions endpoint + +To access the model deployed with `/v1/chat/completions` endpoint for inference, update the body and replace `prompt` field +with `messages`. + +```python +... +body = { + "model": "", # this is a constant + "messages":[{"role":"user","content":[{"type":"text","text":"Who wrote the book Harry Potter?"}]}], + "max_tokens": 250, + "temperature": 0.7, + "top_p": 0.8, +} +... +``` + +## Using Java (with streaming) + +```java +/** + * The OCI SDK must be installed for this example to function properly. + * Installation instructions can be found here: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/javasdk.htm + */ +package org.example; + +import com.oracle.bmc.auth.AuthenticationDetailsProvider; +import com.oracle.bmc.auth.SessionTokenAuthenticationDetailsProvider; +import com.oracle.bmc.http.ClientConfigurator; +import com.oracle.bmc.http.Priorities; +import com.oracle.bmc.http.client.HttpClient; +import com.oracle.bmc.http.client.HttpClientBuilder; +import com.oracle.bmc.http.client.HttpRequest; +import com.oracle.bmc.http.client.HttpResponse; +import com.oracle.bmc.http.client.Method; +import com.oracle.bmc.http.client.jersey.JerseyHttpProvider; +import com.oracle.bmc.http.client.jersey.sse.SseSupport; +import com.oracle.bmc.http.internal.ParamEncoder; +import com.oracle.bmc.http.signing.RequestSigningFilter; + +import javax.ws.rs.core.MediaType; +import java.io.BufferedReader; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.net.URI; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.function.Function; + +public class RestExample { + + public static void main(String[] args) throws Exception { + String configurationFilePath = "~/.oci/config"; + String profile = "DEFAULT"; + + // Pre-Requirement: Allow setting of restricted headers. This is required to allow the SigningFilter + // to set the host header that gets computed during signing of the request. + System.setProperty("sun.net.http.allowRestrictedHeaders", "true"); + + final AuthenticationDetailsProvider provider = + new SessionTokenAuthenticationDetailsProvider(configurationFilePath, profile); + + // 1) Create a request signing filter instance using SessionTokenAuth Provider. + RequestSigningFilter requestSigningFilter = RequestSigningFilter.fromAuthProvider( + provider); + + // 1) Alternatively, RequestSigningFilter can be created from a config file. + // RequestSigningFilter requestSigningFilter = RequestSigningFilter.fromConfigFile(configurationFilePath, profile); + + // 2) Create a Jersey client and register the request signing filter. + // Refer to this page https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/javasdkexamples.htm for information regarding the compatibility of the HTTP client(s) with OCI SDK version. + + HttpClientBuilder builder = JerseyHttpProvider.getInstance() + .newBuilder() + .registerRequestInterceptor(Priorities.AUTHENTICATION, requestSigningFilter) + .baseUri( + URI.create( + "${modelDeployment.modelDeploymentUrl}/") + + ParamEncoder.encodePathParam("predict")); + // 3) Create a request and set the expected type header. + + String jsonPayload = "{}"; // Add payload here with respect to your model example shown in next line: + + // 4) Setup Streaming request + Function> generateTextResultReader = getInputStreamListFunction(); + SseSupport sseSupport = new SseSupport(generateTextResultReader); + ClientConfigurator clientConfigurator = sseSupport.getClientConfigurator(); + clientConfigurator.customizeClient(builder); + + try (HttpClient client = builder.build()) { + HttpRequest request = client + .createRequest(Method.POST) + .header("accepts", MediaType.APPLICATION_JSON) + .header("content-type", MediaType.APPLICATION_JSON) + .header("enable-streaming", "true") + .body(jsonPayload); + + // 5) Invoke the call and get the response. + HttpResponse response = request.execute().toCompletableFuture().get(); + + // 6) Print the response headers and body + Map> responseHeaders = response.headers(); + System.out.println("HTTP Headers " + responseHeaders); + + InputStream responseBody = response.streamBody().toCompletableFuture().get(); + try ( + final BufferedReader reader = new BufferedReader( + new InputStreamReader(responseBody, StandardCharsets.UTF_8) + ) + ) { + String line; + while ((line = reader.readLine()) != null) { + System.out.println(line); + } + } + } catch (Exception ex) { + throw ex; + } + } + + private static Function> getInputStreamListFunction() { + Function> generateTextResultReader = entityStream -> { + try (BufferedReader reader = + new BufferedReader(new InputStreamReader(entityStream))) { + String line; + List generatedTextList = new ArrayList<>(); + while ((line = reader.readLine()) != null) { + if (line.isEmpty() || line.startsWith(":")) { + continue; + } + generatedTextList.add(line); + } + return generatedTextList; + } catch (Exception ex) { + throw new RuntimeException(ex); + } + }; + return generateTextResultReader; + } +} + +``` + +# Stacked Model Evaluation + +## Create Model Evaluations + +### Description + +Creates a new evaluation model using an existing Aqua Stacked deployment. For Stacked deployment, evaluations must be created separately for each model using the same model deployment OCID. + +### Usage + +```bash +ads aqua evaluation create [OPTIONS] +``` + +### Required Parameters + +`--evaluation_source_id [str]` + +The evaluation source id. Must be Stacked deployment OCID. + +`--evaluation_name [str]` + +The name for evaluation. + +`--dataset_path [str]` + +The dataset path for the evaluation. Must be an object storage path.
+Example: `oci://@/path/to/the/dataset.jsonl` + +`--report_path [str]` + +The report path for the evaluation. Must be an object storage path.
+Example: `oci://@/report/path/` + +`--model_parameters [str]` + +The parameters for the evaluation. The `"model"` is required evaluation param in case of Stacked deployment. + +`--shape_name [str]` + +The shape name for the evaluation job infrastructure.
+Example: `VM.Standard.E3.Flex, VM.Standard.E4.Flex, VM.Standard3.Flex, VM.Optimized3.Flex`. + +`--block_storage_size [int]` + +The storage for the evaluation job infrastructure. + +### Optional Parameters + +`--compartment_id [str]` + +The compartment OCID where evaluation is to be created. If not provided, then it defaults to user's compartment. + +`--project_id [str]` + +The project OCID where evaluation is to be created. If not provided, then it defaults to user's project. + +`--evaluation_description [str]` + +The description of the evaluation. Defaults to None. + +`--memory_in_gbs [float]` + +The memory in gbs for the flexible shape selected. + +`--ocpus [float]` + +The ocpu count for the shape selected. + +`--experiment_id [str]` + +The evaluation model version set id. If provided, evaluation model will be associated with it. Defaults to None.
+ +`--experiment_name [str]` + +The evaluation model version set name. If provided, the model version set with the same name will be used if exists, otherwise a new model version set will be created with the name. + +`--experiment_description [str]` + +The description for the evaluation model version set. + +`--log_group_id [str]` + +The log group id for the evaluation job infrastructure. Defaults to None. + +`--log_id [str]` + +The log id for the evaluation job infrastructure. Defaults to None. + +`--metrics [list]` + +The metrics for the evaluation, currently BERTScore and ROGUE are supported.
+Example: `'[{"name": "bertscore", "args": {}}, {"name": "rouge", "args": {}}]` + +`--force_overwrite [bool]` + +A flag to indicate whether to force overwrite the existing evaluation file in object storage if already present. Defaults to `False`. + +### Example + +```bash +ads aqua evaluation create \ + --evaluation_source_id "ocid1.datasciencemodeldeployment.oc1.iad." \ + --evaluation_name "test_evaluation" \ + --dataset_path "oci://@/path/to/the/dataset.jsonl" \ + --report_path "oci://@/report/path/" \ + --model_parameters '{"model":"","max_tokens": 500, "temperature": 0.7, "top_p": 1.0, "top_k": 50}' \ + --shape_name "VM.Standard.E4.Flex" \ + --block_storage_size 50 \ + --metrics '[{"name": "bertscore", "args": {}}, {"name": "rouge", "args": {}}]' +``` + +#### CLI Output + +```json +{ + "id": "ocid1.datasciencemodeldeployment.oc1.iad.", + "name": "test_evaluation", + "aqua_service_model": true, + "state": "CREATING", + "description": null, + "created_on": "2024-02-03 21:21:31.952000+00:00", + "created_by": "ocid1.user.oc1..", + "endpoint": "https://modeldeployment.us-ashburn-1.oci.customer-oci.com/ocid1.datasciencemodeldeployment.oc1.iad.", + "console_link": "https://cloud.oracle.com/data-science/model-deployments/ocid1.datasciencemodeldeployment.oc1.iad.?region=us-ashburn-1", + "shape_info": { + "instance_shape": "VM.Standard.E4.Flex", + "instance_count": 1, + "ocpus": 1.0, + "memory_in_gbs": 16.0 + }, + "tags": { + "aqua_service_model": "ocid1.datasciencemodel.oc1.iad.#Mistral-7B-v0.1", + "OCI_AQUA": "" + } +} +``` + +For other operations related to **Evaluation**, such as listing evaluations and retrieving evaluation details, please refer to [AQUA CLI tips](cli-tips.md) diff --git a/ai-quick-actions/web_assets/deploy-multi-model-advanced-options.png b/ai-quick-actions/web_assets/deploy-multi-model-advanced-options.png new file mode 100644 index 00000000..e7217ea7 Binary files /dev/null and b/ai-quick-actions/web_assets/deploy-multi-model-advanced-options.png differ diff --git a/ai-quick-actions/web_assets/deploy-multi.png b/ai-quick-actions/web_assets/deploy-multi.png new file mode 100644 index 00000000..3268dfd9 Binary files /dev/null and b/ai-quick-actions/web_assets/deploy-multi.png differ diff --git a/ai-quick-actions/web_assets/deploy-stack-model-advanced-options.png b/ai-quick-actions/web_assets/deploy-stack-model-advanced-options.png new file mode 100644 index 00000000..083bfab0 Binary files /dev/null and b/ai-quick-actions/web_assets/deploy-stack-model-advanced-options.png differ diff --git a/ai-quick-actions/web_assets/deploy-stack.png b/ai-quick-actions/web_assets/deploy-stack.png new file mode 100644 index 00000000..25093133 Binary files /dev/null and b/ai-quick-actions/web_assets/deploy-stack.png differ diff --git a/ai-quick-actions/web_assets/try-multi-model.png b/ai-quick-actions/web_assets/try-multi-model.png new file mode 100644 index 00000000..dd36f231 Binary files /dev/null and b/ai-quick-actions/web_assets/try-multi-model.png differ diff --git a/ai-quick-actions/web_assets/try-stack-model.png b/ai-quick-actions/web_assets/try-stack-model.png new file mode 100644 index 00000000..d5bff24a Binary files /dev/null and b/ai-quick-actions/web_assets/try-stack-model.png differ