File tree Expand file tree Collapse file tree 1 file changed +8
-2
lines changed
docs/source/getting-started Expand file tree Collapse file tree 1 file changed +8
-2
lines changed Original file line number Diff line number Diff line change @@ -40,6 +40,8 @@ You can use our official offline example script to run offline inference as foll
4040
4141``` bash
4242cd examples/
43+ # Change the model path to your own model path
44+ export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
4345python offline_inference.py
4446```
4547
@@ -58,7 +60,10 @@ export PYTHONHASHSEED=123456
5860Run the following command to start the vLLM server with the Qwen/Qwen2.5-14B-Instruct model:
5961
6062``` bash
61- vllm serve /home/models/Qwen2.5-14B-Instruct \
63+ # Change the model path to your own model path
64+ export MODEL_PATH=/home/models/Qwen2.5-14B-Instruct
65+ vllm serve ${MODEL_PATH} \
66+ --served-model-name vllm_cpu_offload \
6267--max-model-len 20000 \
6368--tensor-parallel-size 2 \
6469--gpu_memory_utilization 0.87 \
@@ -95,11 +100,12 @@ After successfully started the vLLM server,You can interact with the API as fo
95100curl http://localhost:7800/v1/completions \
96101 -H " Content-Type: application/json" \
97102 -d ' {
98- "model": "/home/models/Qwen2.5-14B-Instruct ",
103+ "model": "vllm_cpu_offload ",
99104 "prompt": "Shanghai is a",
100105 "max_tokens": 7,
101106 "temperature": 0
102107 }'
103108```
104109</details >
105110
111+ Note: If you want to disable vLLM prefix cache to test the cache ability of UCM, you can add ` --no-enable-prefix-caching ` to the command line.
You can’t perform that action at this time.
0 commit comments