Skip to content

Commit 13ee8a7

Browse files
committed
Document Sync by Tina
1 parent bdabc54 commit 13ee8a7

File tree

1 file changed

+180
-0
lines changed

1 file changed

+180
-0
lines changed
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
sidebar_position: 1
3+
---
4+
5+
# Installation with ROCm (Experimental)
6+
7+
## Build the wheel from source and install
8+
ServerlessLLM Store (`sllm-store`) currently provides experimental support for ROCm platform. Due to an internal bug in ROCm, serverless-llm-store may face a GPU memory leak in ROCm before version 6.2.0, as noted in [issue](https://github.com/ROCm/HIP/issues/3580).
9+
10+
Currently, `pip install .` does not work with ROCm. We suggest you build `sllm-store` wheel and manually install it in your environment.
11+
12+
To build `sllm-store` from source, we suggest you using the docker and build it in ROCm container.
13+
14+
1. Clone the repository and enter the `store` directory:
15+
16+
```bash
17+
git clone git@github.com:ServerlessLLM/ServerlessLLM.git
18+
cd ServerlessLLM/sllm_store
19+
```
20+
21+
2. Build the Docker image from `Dockerfile.rocm`. The `Dockerfile.rocm` is build on top of `rocm/pytorch:rocm6.2_ubuntu22.04_py3.10_pytorch_release_2.3.0` image.
22+
23+
``` bash
24+
docker build -t sllm_store_rocm -f Dockerfile.rocm .
25+
```
26+
27+
3. Build the package inside the ROCm docker container
28+
``` bash
29+
docker run -it --rm -v $(pwd)/dist:/app/dist sllm_store_rocm /bin/bash
30+
rm -rf /app/dist/* # remove the existing built files
31+
python setup.py sdist bdist_wheel
32+
```
33+
34+
4. Install pytorch and package in local environment
35+
``` bash
36+
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
37+
pip install dist/*.whl
38+
```
39+
40+
## Verify the Installation
41+
42+
### End to end tests
43+
44+
#### Transformer model Loading and Inference
45+
46+
1. Save the `faceboo/opt-1.3b` model in `./models` directory
47+
48+
``` bash
49+
python3 examples/sllm_store/save_transformers_model.py --model_name facebook/opt-1.3b --storage_path ./models
50+
```
51+
52+
2. Start the `sllm-store` server
53+
54+
``` bash
55+
sllm-store-server
56+
```
57+
58+
3. Load the model and run the inference in another terminal
59+
60+
``` bash
61+
python3 examples/sllm_store/load_transformers_model.py --model_name facebook/opt-1.3b --storage_path ./models
62+
```
63+
64+
Expected Output:
65+
66+
``` bash
67+
DEBUG 10-31 10:43:14 transformers.py:178] load_dict_non_blocking takes 0.008747100830078125 seconds
68+
DEBUG 10-31 10:43:14 transformers.py:189] load config takes 0.0016036033630371094 seconds
69+
DEBUG 10-31 10:43:14 torch.py:137] allocate_cuda_memory takes 0.0041697025299072266 seconds
70+
DEBUG 10-31 10:43:14 client.py:72] load_into_gpu: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
71+
INFO 10-31 10:43:14 client.py:113] Model loaded: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
72+
INFO 10-31 10:43:14 torch.py:160] restore state_dict takes 0.0017423629760742188 seconds
73+
DEBUG 10-31 10:43:14 transformers.py:199] load model takes 0.17534756660461426 seconds
74+
INFO 10-31 10:43:14 client.py:117] confirm_model_loaded: facebook/opt-1.3b, 544e032d-9080-429f-bbc0-cdbc2a298060
75+
INFO 10-31 10:43:14 client.py:125] Model loaded
76+
Model loading time: 0.20s
77+
~/miniconda3/envs/sllm/lib/python3.10/site-packages/transformers/generation/utils.py:1249: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
78+
warnings.warn(
79+
Hello, my dog is cute and I want to give him a good home. I have a
80+
```
81+
82+
#### vLLM model Loading and Inference
83+
:::tip
84+
Directly installing vLLM v0.5.0.post1 may not work with ROCm 6.2.0. This issue is due to the ambiguity of a function call in ROCm 6.2.0. You may change the vLLM's source code as in this [commit](https://github.com/vllm-project/vllm/commit/9984605412de1171a72d955cfcb954725edd4d6f).
85+
86+
Similar as in CUDA, you need to apply our patch `sllm_store/vllm_patch/sllm_load.patch` to the installed vLLM library.
87+
```bash
88+
./sllm_store/vllm_patch/patch.sh
89+
```
90+
:::
91+
92+
1. Save the `faceboo/opt-1.3b` model in `./models` directory
93+
94+
``` bash
95+
python3 examples/sllm_store/save_vllm_model.py --model_name facebook/opt-1.3b --storage_path ./models
96+
```
97+
98+
2. Start the `sllm-store` server
99+
100+
``` bash
101+
sllm-store-server
102+
```
103+
104+
3. Load the model and run the inference in another terminal
105+
106+
``` bash
107+
python3 examples/sllm_store/load_vllm_model.py --model_name facebook/opt-1.3b --storage_path ./models
108+
```
109+
110+
Expected Output:
111+
112+
``` bash
113+
INFO 10-31 11:05:16 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='./models/facebook/opt-1.3b', speculative_config=None, tokenizer='./models/facebook/opt-1.3b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.SERVERLESS_LLM, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./models/facebook/opt-1.3b)
114+
INFO 10-31 11:05:17 selector.py:56] Using ROCmFlashAttention backend.
115+
INFO 10-31 11:05:17 selector.py:56] Using ROCmFlashAttention backend.
116+
DEBUG 10-31 11:05:17 torch.py:137] allocate_cuda_memory takes 0.0005428791046142578 seconds
117+
DEBUG 10-31 11:05:17 client.py:72] load_into_gpu: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
118+
INFO 10-31 11:05:17 client.py:113] Model loaded: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
119+
INFO 10-31 11:05:17 torch.py:160] restore state_dict takes 0.0013034343719482422 seconds
120+
INFO 10-31 11:05:17 client.py:117] confirm_model_loaded: facebook/opt-1.3b/rank_0, 9d7c0425-f652-4c4c-b1c5-fb6df0aab0a8
121+
INFO 10-31 11:05:17 client.py:125] Model loaded
122+
INFO 10-31 11:05:17 model_runner.py:160] Loading model weights took 0.0000 GB
123+
INFO 10-31 11:05:25 gpu_executor.py:83] # GPU blocks: 18509, # CPU blocks: 1365
124+
INFO 10-31 11:05:26 model_runner.py:903] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
125+
INFO 10-31 11:05:26 model_runner.py:907] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
126+
INFO 10-31 11:05:31 model_runner.py:979] Graph capturing finished in 6 secs.
127+
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 12.13it/s, est. speed input: 78.83 toks/s, output: 194.04 toks/s]
128+
Prompt: 'Hello, my name is', Generated text: ' Joel, and I have been working as a web designer/developer for the'
129+
Prompt: 'The president of the United States is', Generated text: " speaking in an increasingly important national security forum and he's not using the right words"
130+
Prompt: 'The capital of France is', Generated text: " Paris.\nYeah but you couldn't get it through a French newspaper!"
131+
Prompt: 'The future of AI is', Generated text: ' literally in your hands\nDespite all the hype, AI isn’t here'
132+
```
133+
134+
### Python tests
135+
136+
1. Install the test dependencies
137+
138+
```bash
139+
cd ServerlessLLM
140+
pip install -r requirements-test.txt
141+
```
142+
143+
2. Run the tests
144+
```
145+
cd ServerlessLLM/sllm_store/tests/python
146+
pytest
147+
```
148+
149+
### C++ tests
150+
151+
1. Build the C++ tests
152+
153+
```bash
154+
cd ServerlessLLM/sllm_store
155+
bash build.sh
156+
```
157+
158+
2. Run the tests
159+
160+
```bash
161+
cd ServerlessLLM/sllm_store/build
162+
ctest --output-on-failure
163+
```
164+
165+
## Tested Hardware
166+
+ OS: Ubuntu 22.04
167+
+ ROCm: 6.2
168+
+ PyTorch: 2.3.0
169+
+ GPU: MI100s (gfx908), MI200s (gfx90a)
170+
171+
172+
## Known issues
173+
174+
1. GPU memory leak in ROCm before version 6.2.0.
175+
176+
This issue is due to an internal bug in ROCm. After the inference instance is completed, the GPU memory is still occupied and not released. For more information, please refer to [issue](https://github.com/ROCm/HIP/issues/3580).
177+
178+
2. vLLM v0.5.0.post1 can not be built in ROCm 6.2.0
179+
180+
This issue is due to the ambiguity of a function call in ROCm 6.2.0. You may change the vLLM's source code as in this [commit](https://github.com/vllm-project/vllm/commit/9984605412de1171a72d955cfcb954725edd4d6f).

0 commit comments

Comments
 (0)