Update documentation from main repository

future-xy · future-xy · commit 919547d6674f · 2025-07-18T09:47:54.000Z
diff --git a/docs/api/cli.md b/docs/api/cli.md
diff --git a/docs/api/intro.md b/docs/api/intro.md
@@ -6,4 +6,4 @@ sidebar_position: 1
 
 Welcome to the ServerlessLLM API documentation. This section contains detailed information about the various APIs provided by ServerlessLLM:
 
-- [CLI API](./cli.md) - Documentation for the `sllm-cli` command-line interface
+- [CLI API](./cli.md) - Documentation for the `sllm` command-line interface
diff --git a/docs/stable/deployment/multi_machine.md b/docs/stable/deployment/multi_machine.md
@@ -186,22 +186,22 @@ This output confirms that both the head node and worker node are properly connec
 3. Ensure each worker has its own `RAY_NODE_IP` set correctly
 :::
 
-### Step 3: Use `sllm-cli` to manage models
+### Step 3: Use `sllm` to manage models
 
 #### Configure the Environment
 
-**On any machine with `sllm-cli` installed, set the `LLM_SERVER_URL` environment variable:**
+**On any machine with `sllm` installed, set the `LLM_SERVER_URL` environment variable:**
 
 > Replace `<HEAD_NODE_IP>` with the actual IP address of the head node.
 
 ```bash
 export LLM_SERVER_URL=http://<HEAD_NODE_IP>:8343
 ```
 
-#### Deploy a Model Using `sllm-cli`
+#### Deploy a Model Using `sllm`
 
 ```bash
-sllm-cli deploy --model facebook/opt-1.3b
+sllm deploy --model facebook/opt-1.3b
 ```
 
 > Note: This command will spend some time downloading the model from the Hugging Face Model Hub. You can use any model from the [Hugging Face Model Hub](https://huggingface.co/models) by specifying the model name in the `--model` argument.
@@ -238,12 +238,12 @@ Expected output:
 {"id":"chatcmpl-23d3c0e5-70a0-4771-acaf-bcb2851c6ea6","object":"chat.completion","created":1721706121,"model":"facebook/opt-1.3b","choices":[{"index":0,"message":{"role":"assistant","content":"system: You are a helpful assistant.\nuser: What is your name?\nsystem: I am a helpful assistant.\n"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"completion_tokens":26,"total_tokens":42}}
 ```
 
-#### Delete a Deployed Model Using `sllm-cli`
+#### Delete a Deployed Model Using `sllm`
 
 When you're done using a model, you can delete it:
 
 ```bash
-sllm-cli delete facebook/opt-1.3b
+sllm delete facebook/opt-1.3b
 ```
 
 This will remove the specified model from the ServerlessLLM server.
diff --git a/docs/stable/deployment/single_machine.md b/docs/stable/deployment/single_machine.md
@@ -158,18 +158,18 @@ I20241231 17:13:25.557906 2165054 checkpoint_store.cpp:83] Memory pool created w
 INFO 12-31 17:13:25 server.py:243] Starting gRPC server on 0.0.0.0:8073
 ```
 
-### 3. Start ServerlessLLM Serve
+### 3. Start ServerlessLLM
+
+Now, start the ServerlessLLM service process using `sllm start`.
 
-Now, start the ServerlessLLM Serve process ( `sllm-serve`).
 
 Open a new terminal and run:
 
 ```bash
-conda activate sllm
-sllm-serve start
+sllm start
 ```
 
-At this point, you should have four terminals open: one for the Ray head node, one for the Ray worker node, one for the ServerlessLLM Store server, and one for ServerlessLLM Serve.
+At this point, you should have four terminals open: one for the Ray head node, one for the Ray worker node, one for the ServerlessLLM Store server, and one for the ServerlessLLM service (started via `sllm start`).
 
 ### 4. Deploy a Model
 
@@ -179,7 +179,7 @@ Open a new terminal and run:
 
 ```bash
 conda activate sllm
-sllm-cli deploy --model facebook/opt-1.3b
+sllm deploy --model facebook/opt-1.3b
 ```
 
 This command downloads the specified model from Hugging Face Hub. To load a model from a local path, you can use a `config.json` file. Refer to the [CLI API documentation](../../api/cli.md#example-configuration-file-configjson) for details.
@@ -211,7 +211,7 @@ Expected output:
 To delete a deployed model, use the following command:
 
 ```bash
-sllm-cli delete facebook/opt-1.3b
+sllm delete facebook/opt-1.3b
 ```
 
 This command removes the specified model from the ServerlessLLM server.
diff --git a/docs/stable/deployment/slurm_cluster.md b/docs/stable/deployment/slurm_cluster.md
@@ -67,15 +67,15 @@ compute    up        2  down   infinite   JobNode[16-17]
 Only one JobNode is enough.
 
 **`sbatch` Node Selection**
+Let's start a head on the main job node (`JobNode01`) and add the worker on other job node (`JobNode02`). The head and the worker should be on different job nodes to avoid resource contention. The `sllm-store` should be started on the job node that runs worker (`JobNode02`), for passing the model weights, and the `sllm start` should be started on the main job node (`JobNode01`), finally you can use `sllm` to manage the models on the login node.
 
-Let's start a head on the main job node (`JobNode01`) and add the worker on other job node (`JobNode02`). The head and the worker should be on different job nodes to avoid resource contention. The `sllm-store` should be started on the job node that runs worker (`JobNode02`), for passing the model weights, and the `sllm-serve` should be started on the main job node (`JobNode01`), finally you can use `sllm-cli` to manage the models on the login node.
 
 Note: `JobNode02` requires GPU, but `JobNode01` does not.
 - **Head**: JobNode01
 - **Worker**: JobNode02
 - **sllm-store**: JobNode02
 - **sllm-serve**: JobNode01
-- **sllm-cli**: Login Node
+- **sllm**: Login Node
 
 ---
 ## SRUN
@@ -167,7 +167,7 @@ In the 5th window, let's deploy a model to the ServerlessLLM server. You can dep
 ```shell
 source /opt/conda/bin/activate
 conda activate sllm
-sllm-cli deploy --model facebook/opt-1.3b --backend transformers
+sllm deploy --model facebook/opt-1.3b --backend transformers
 ```
 This will download the model from HuggingFace transformers. After deploying, you can query the model by any OpenAI API client. For example, you can use the following Python code to query the model:
 ```shell
@@ -189,7 +189,7 @@ Expected output:
 ### Step 5: Clean up
 To delete a deployed model, use the following command:
 ```shell
-sllm-cli delete facebook/opt-1.3b
+sllm delete facebook/opt-1.3b
 ```
 This will remove the specified model from the ServerlessLLM server.
 
@@ -350,8 +350,8 @@ We will start the worker node and store in the same script. Because the server l
 
    conda activate sllm
 
-   sllm-serve start --host <HEAD_NODE_IP>
-   # sllm-serve start --host <HEAD_NODE_IP> --port <avail_port> # if you have changed the port
+   sllm start --host <HEAD_NODE_IP>
+   # sllm start --host <HEAD_NODE_IP> --port <avail_port> # if you have changed the port
    ```
    - Replace `your_partition` in the script as before.
    - Replace `/path/to/ServerlessLLM` as before.
@@ -369,17 +369,17 @@ We will start the worker node and store in the same script. Because the server l
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://xxx.xxx.xx.xx:8343 (Press CTRL+C to quit)
    ```
-### Step 4: Use sllm-cli to manage models
+### Step 4: Use sllm to manage models
 1. **You can do this step on login node, and set the ```LLM_SERVER_URL``` environment variable:**
    ```shell
    $ conda activate sllm
    (sllm)$ export LLM_SERVER_URL=http://<HEAD_NODE_IP>:8343
    ```
    - Replace `<HEAD_NODE_IP>` with the actual IP address of the head node.
    - Replace ```8343``` with the actual port number (`<avail_port>` in Step1) if you have changed it.
-2. **Deploy a Model Using ```sllm-cli```**
+2. **Deploy a Model Using ```sllm```**
    ```shell
-   (sllm)$ sllm-cli deploy --model facebook/opt-1.3b
+   (sllm)$ sllm deploy --model facebook/opt-1.3b
    ```
 ### Step 5: Query the Model Using OpenAI API Client
    **You can use the following command to query the model:**
diff --git a/docs/stable/features/live_migration.md b/docs/stable/features/live_migration.md
@@ -72,8 +72,8 @@ export LLM_SERVER_URL=http://127.0.0.1:8343
 
 Deploy the models:
 ```bash
-sllm-cli deploy --config config-qwen-1.5b.json
-sllm-cli deploy --config config-qwen-3b.json
+sllm deploy --config config-qwen-1.5b.json
+sllm deploy --config config-qwen-3b.json
 ```
 
 3. **Verify the Deployment**
@@ -139,8 +139,8 @@ export LLM_SERVER_URL=http://127.0.0.1:8343
 Deploy the models:
 
 ```bash
-sllm-cli deploy --config config-qwen-1.5b.json
-sllm-cli deploy --config config-qwen-3b.json
+sllm deploy --config config-qwen-1.5b.json
+sllm deploy --config config-qwen-3b.json
 ```
 
 3. **Verify the Deployment**
diff --git a/docs/stable/features/peft_lora_serving.md b/docs/stable/features/peft_lora_serving.md
@@ -71,7 +71,7 @@ export LLM_SERVER_URL=http://127.0.0.1:8343
 ```
 2. Deploy models with specified lora adapters.
 ```bash
-sllm-cli deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters demo_lora1=peft-internal-testing/opt-125m-dummy-lora demo_lora2=monsterapi/opt125M_alpaca
+sllm deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters "demo_lora1=peft-internal-testing/opt-125m-dummy-lora demo_lora2=monsterapi/opt125M_alpaca"
 ```
 3. Verify the deployment.
 ```bash
@@ -99,16 +99,16 @@ curl $LLM_SERVER_URL/v1/chat/completions \
     }'
 ```
 ### Step 5: Update LoRA Adapters
-If you wish to switch to a different set of LoRA adapters, you can still use `sllm-cli deploy` command with updated adapter configurations. ServerlessLLM will automatically reload the new adapters without restarting the backend.
+If you wish to switch to a different set of LoRA adapters, you can still use `sllm deploy` command with updated adapter configurations. ServerlessLLM will automatically reload the new adapters without restarting the backend.
 ```bash
-sllm-cli deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters demo-lora1=edbeeching/opt-125m-lora demo-lora2=Hagatiana/opt-125m-lora
+sllm deploy --model facebook/opt-125m --backend transformers --enable-lora --lora-adapters "demo-lora1=edbeeching/opt-125m-lora  demo-lora2=Hagatiana/opt-125m-lora"
 ```
 
 ### Step 6: Clean Up
 
 Delete the lora adapters by running the following command (this command will only delete lora adapters, the base model won't be deleted):
 ```bash
-sllm-cli delete facebook/opt-125m --lora-adapters demo-lora1 demo-lora2
+sllm delete facebook/opt-125m --lora-adapters "demo-lora1 demo-lora2"
 ```
 If you need to stop and remove the containers, you can use the following commands:
 ```bash
diff --git a/docs/stable/features/storage_aware_scheduling.md b/docs/stable/features/storage_aware_scheduling.md
@@ -70,8 +70,8 @@ In the `examples/storage_aware_scheduling` directory, the example configuration
 conda activate sllm
 export LLM_SERVER_URL=http://127.0.0.1:8343
 
-sllm-cli deploy --config config-opt-2.7b.json
-sllm-cli deploy --config config-opt-1.3b.json
+sllm deploy --config config-opt-2.7b.json
+sllm deploy --config config-opt-1.3b.json
 ```
 
 3. Verify the deployment.
@@ -112,7 +112,7 @@ As shown in the log message, the model "facebook/opt-2.7b" is scheduled on serve
 Delete the model deployment by running the following command:
 
 ```bash
-sllm-cli delete facebook/opt-1.3b facebook/opt-2.7b
+sllm delete facebook/opt-1.3b facebook/opt-2.7b
 ```
 
 If you need to stop and remove the containers, you can use the following commands:
diff --git a/docs/stable/getting_started.md b/docs/stable/getting_started.md
@@ -4,7 +4,7 @@ sidebar_position: 1
 
 # Getting Started
 
-This guide demonstrates how to quickly set up a local ServerlessLLM cluster using Docker Compose on a single machine. We will initialize a minimal cluster, consisting of a head node and a single worker node. Then, we'll deploy a model using the `sllm-cli` and query the deployment through an OpenAI-compatible API.
+This guide demonstrates how to quickly set up a local ServerlessLLM cluster using Docker Compose on a single machine. We will initialize a minimal cluster, consisting of a head node and a single worker node. Then, we'll deploy a model using the `sllm` and query the deployment through an OpenAI-compatible API.
 
 :::note
 We strongly recommend using Docker (Compose) to manage your ServerlessLLM cluster, whether you are using ServerlessLLM for testing or development. However, if Docker is not a viable option for you, please refer to the [deploy from scratch guide](./deployment/single_machine.md).
@@ -77,18 +77,18 @@ INFO:     Uvicorn running on http://0.0.0.0:8343 (Press CTRL+C to quit)
 (FcfsScheduler pid=1604) INFO 05-26 15:40:49 fcfs_scheduler.py:111] Starting control loop
 ```
 
-## Deploy a Model Using sllm-cli
+## Deploy a Model Using sllm
 
 Set the `LLM_SERVER_URL` environment variable:
 
 ```bash
 export LLM_SERVER_URL=http://127.0.0.1:8343
 ```
 
-Deploy a model to the ServerlessLLM cluster using the `sllm-cli`:
+Deploy a model to the ServerlessLLM cluster using the `sllm`:
 
 ```bash
-sllm-cli deploy --model facebook/opt-1.3b
+sllm deploy --model facebook/opt-1.3b
 ```
 > Note: This command will take some time to download the model from the Hugging Face Model Hub.
 > You can use any model from the [Hugging Face Model Hub](https://huggingface.co/models) by specifying its name in the `--model` argument.
@@ -125,7 +125,7 @@ Expected output:
 To delete a deployed model, execute the following command:
 
 ```bash
-sllm-cli delete facebook/opt-1.3b
+sllm delete facebook/opt-1.3b
 ```
 
 This command removes the specified model from the ServerlessLLM server.

Original file line number	Diff line number	Diff line change
`@@ -6,4 +6,4 @@ sidebar_position: 1`
`6`	`6`
`7`	`7`	`Welcome to the ServerlessLLM API documentation. This section contains detailed information about the various APIs provided by ServerlessLLM:`
`8`	`8`
`9`		-- [CLI API](./cli.md) - Documentation for the `sllm-cli` command-line interface
	`9`	+- [CLI API](./cli.md) - Documentation for the `sllm` command-line interface