Skip to content

Commit 477dc28

Browse files
authored
[Docs]Improve the quick_start.md (#275)
* Improve the quick_start.md * Add a Note
1 parent b271703 commit 477dc28

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

docs/source/getting-started/quick_start.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ You can use our official offline example script to run offline inference as foll
4040

4141
```bash
4242
cd examples/
43+
# Change the model path to your own model path
44+
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
4345
python offline_inference.py
4446
```
4547

@@ -58,7 +60,10 @@ export PYTHONHASHSEED=123456
5860
Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
5961

6062
```bash
61-
vllm serve /home/models/Qwen2.5-14B-Instruct \
63+
# Change the model path to your own model path
64+
export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
65+
vllm serve ${MODEL_PATH} \
66+
--served-model-name vllm_cpu_offload \
6267
--max-model-len 20000 \
6368
--tensor-parallel-size 2 \
6469
--gpu_memory_utilization 0.87 \
@@ -95,11 +100,12 @@ After successfully started the vLLM server,You can interact with the API as fo
95100
curl http://localhost:7800/v1/completions \
96101
-H "Content-Type: application/json" \
97102
-d '{
98-
"model": "/home/models/Qwen2.5-14B-Instruct",
103+
"model": "vllm_cpu_offload",
99104
"prompt": "Shanghai is a",
100105
"max_tokens": 7,
101106
"temperature": 0
102107
}'
103108
```
104109
</details>
105110

111+
Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add `--no-enable-prefix-caching` to the command line.

0 commit comments

Comments
 (0)