Skip to content

Commit c8106fc

Browse files
committed
Update documentation from main repository
1 parent 0ba55e4 commit c8106fc

File tree

2 files changed

+277
-0
lines changed

2 files changed

+277
-0
lines changed
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
---
2+
sidebar_position: 3
3+
---
4+
5+
# Quantization
6+
7+
This example demonstrates the use of quantization within the ServerlessLLM framework to optimize model serving. Quantization is a technique used to reduce the memory footprint and computational requirements of a large language model by representing its weights with lower-precision data types, such as 8-bit integers (int8). This example will showcase how to deploy and serve a quantized model in a ServerlessLLM cluster.
8+
9+
## Pre-requisites
10+
11+
We will use Docker Compose to run a ServerlessLLM cluster in this example. Therefore, please make sure you have read the Quickstart Guide before proceeding.
12+
13+
## Usage
14+
Start a local Docker-based ray cluster using Docker Compose.
15+
16+
## Step 1: Set up the Environment
17+
18+
Create a directory for this example and download the `docker-compose.yml` file.
19+
20+
```bash
21+
mkdir sllm-quantization-example && cd sllm-quantization-example
22+
curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml
23+
24+
## Step 2: Configuration
25+
26+
Set the Model Directory. Create a directory on your host machine where models will be stored and set the `MODEL_FOLDER` environment variable to point to this directory:
27+
28+
```bash
29+
export MODEL_FOLDER=/path/to/your/models
30+
```
31+
32+
Replace `/path/to/your/models` with the actual path where you want to store the models.
33+
34+
## Step 3: Start the Services
35+
36+
Start the ServerlessLLM services using Docker Compose:
37+
38+
```bash
39+
docker compose up -d
40+
```
41+
42+
This command will start the Ray head node and two worker nodes defined in the `docker-compose.yml` file.
43+
44+
:::tip
45+
Use the following command to monitor the logs of the head node:
46+
47+
```bash
48+
docker logs -f sllm_head
49+
```
50+
:::
51+
52+
## Step 4: Create Quantization and Deployment Configurations
53+
54+
First, we'll generate a standard Hugging Face BitsAndBytesConfig and save it to a JSON file. Then, we'll create a deployment configuration file with these quantization settings embedded in it.
55+
56+
1. Generate the Quantization Config
57+
58+
Create a Python script named `get_config.py` in the current directory with the following content:
59+
```python
60+
# get_config.py
61+
from transformers import BitsAndBytesConfig
62+
63+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
64+
quantization_config.to_json_file("quantization_config.json")
65+
66+
```
67+
68+
Run the script to generate `quantization_config.json`:
69+
```bash
70+
python get_config.py
71+
```
72+
73+
74+
2. Create the Deployment Config
75+
76+
Now, create a file named `quantized_deploy_config.json`. This file tells ServerlessLLM which model to deploy and instructs the backend to use the quantization settings. You should copy the contents of `quantization_config.json` into the `quantization_config` field below. A template can be found in `sllm/cli/default_config.json`.
77+
78+
```json
79+
{
80+
"model": "facebook/opt-1.3b",
81+
"backend": "transformers",
82+
"num_gpus": 1,
83+
"auto_scaling_config": {
84+
"metric": "concurrency",
85+
"target": 1,
86+
"min_instances": 0,
87+
"max_instances": 10,
88+
"keep_alive": 0
89+
},
90+
"backend_config": {
91+
"pretrained_model_name_or_path": "",
92+
"device_map": "auto",
93+
"torch_dtype": "float16",
94+
"hf_model_class": "AutoModelForCausalLM",
95+
"quantization_config": {
96+
"_load_in_4bit": true,
97+
"_load_in_8bit": false,
98+
"bnb_4bit_compute_dtype": "float32",
99+
"bnb_4bit_quant_storage": "uint8",
100+
"bnb_4bit_quant_type": "fp4",
101+
"bnb_4bit_use_double_quant": false,
102+
"llm_int8_enable_fp32_cpu_offload": false,
103+
"llm_int8_has_fp16_weight": false,
104+
"llm_int8_skip_modules": null,
105+
"llm_int8_threshold": 6.0,
106+
"load_in_4bit": true,
107+
"load_in_8bit": false,
108+
"quant_method": "bitsandbytes"
109+
}
110+
}
111+
}
112+
113+
```
114+
115+
> Note: Quantization currently only supports the "transformers" backend. Support for other backends will come soon.
116+
117+
## Step 5: Deploy the Quantized Model
118+
With the configuration files in place, deploy the model using the `sllm-cli`.
119+
120+
```bash
121+
conda activate sllm
122+
export LLM_SERVER_URL=http://127.0.0.1:8343
123+
124+
sllm-cli deploy --config quantized_deploy_config.json
125+
```
126+
127+
## Step 6: Verify the deployment.
128+
Send an inference to the server to query the model:
129+
130+
```bash
131+
curl $LLM_SERVER_URL/v1/chat/completions \
132+
-H "Content-Type: application/json" \
133+
-d '{
134+
"model": "facebook/opt-1.3b",
135+
"messages": [
136+
{"role": "system", "content": "You are a helpful assistant."},
137+
{"role": "user", "content": "What is your name?"}
138+
]
139+
}'
140+
```
141+
142+
To verify the model is being loaded in the desired precision, check the logs (`docker logs sllm_head`). You should see that the model is indeed being loaded in `fp4`.
143+
144+
145+
```log
146+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:321] load config takes 0.0030286312103271484 seconds
147+
(RoundRobinRouter pid=481) INFO 07-02 20:01:49 roundrobin_router.py:272] []
148+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:331] load model takes 0.2806234359741211 seconds
149+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:338] device_map: OrderedDict([('', 0)])
150+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:345] compute_device_placement takes 0.18753838539123535 seconds
151+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:376] allocate_cuda_memory takes 0.0020012855529785156 seconds
152+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 client.py:72] load_into_gpu: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa
153+
(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:49 client.py:113] Model loaded: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa
154+
(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:49 transformers.py:398] restore state_dict takes 0.0007319450378417969 seconds
155+
(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:411] using precision: fp4
156+
(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:50 client.py:117] confirm_model_loaded: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa
157+
```
158+
159+
You should receive a successful JSON response from the model.
160+
161+
## Step 7: Clean Up
162+
163+
Delete the model deployment by running the following command:
164+
165+
```bash
166+
sllm-cli delete facebook/opt-1.3b
167+
```
168+
169+
If you need to stop and remove the containers, you can use the following commands:
170+
171+
```bash
172+
docker compose down
173+
```
174+
175+

docs/stable/store/quantization.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
sidebar_position: 2
3+
---
4+
5+
# Quantization
6+
7+
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
8+
9+
ServerlessLLM currently supports `bitsandbytes` quantization, which reduces model memory usage by converting weights to lower-precision data types. You can configure this by passing a `BitsAndBytesConfig` object when loading a model.
10+
11+
Available precisions include:
12+
- `int8`
13+
- `fp4`
14+
- `nf4`
15+
16+
> Note: CPU offloading and dequantization is not currently supported.
17+
18+
## 8-bit Quantization (`int8`)
19+
20+
8-bit quantization halves the memory usage compared to 16-bit precision with minimal impact on model accuracy. It is a robust and recommended starting point for quantization.
21+
22+
```python
23+
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
24+
25+
# Configure 8-bit quantization
26+
quantization_config = BitsAndBytesConfig(
27+
load_in_8bit=True
28+
)
29+
30+
# Load the model with the config
31+
model_8bit = AutoModelForCausalLM.from_pretrained(
32+
"facebook/opt-1.3b",
33+
quantization_config=quantization_config,
34+
device_map="auto",
35+
)
36+
```
37+
38+
## 4-bit Quantization (`fp4`)
39+
FP4 (4-bit Floating Point) quantization offers more aggressive memory savings than 8-bit. It is a good option for running very large models on consumer-grade hardware.
40+
41+
```python
42+
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
43+
44+
# Configure 4-bit FP4 quantization
45+
quantization_config = BitsAndBytesConfig(
46+
load_in_4bit=True,
47+
bnb_4bit_quant_type="fp4"
48+
)
49+
50+
# Load the model with the config
51+
model_fp4 = AutoModelForCausalLM.from_pretrained(
52+
"facebook/opt-1.3b",
53+
quantization_config=quantization_config,
54+
device_map="auto",
55+
)
56+
```
57+
58+
## 4-bit Quantization (`nf4`)
59+
NF4 (4-bit NormalFloat) is an advanced data type optimized for models whose weights follow a normal distribution. NF4 is generally the recommended 4-bit option as it often yields better model accuracy compared to FP4.
60+
61+
```python
62+
import torch
63+
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
64+
65+
# Configure 4-bit NF4 quantization
66+
quantization_config = BitsAndBytesConfig(
67+
load_in_4bit=True,
68+
bnb_4bit_quant_type="nf4"
69+
)
70+
71+
# Load the model with the config
72+
model_nf4 = AutoModelForCausalLM.from_pretrained(
73+
"facebook/opt-1.3b",
74+
quantization_config=quantization_config,
75+
device_map="auto",
76+
)
77+
```
78+
79+
## `torch_dtype` (Data Type for Unquantized Layers)
80+
The `torch_dtype` parameter sets the data type for model layers that are not quantized (e.g. `LayerNorm`). Setting this to `torch.float16` or `torch.bfloat16` can further reduce memory usage. If unspecified, these layers default to `torch.float16`.
81+
82+
```python
83+
import torch
84+
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
85+
86+
# Configure 4-bit NF4 quantization
87+
quantization_config = BitsAndBytesConfig(
88+
load_in_4bit=True,
89+
bnb_4bit_quant_type="nf4"
90+
)
91+
92+
# Load model, casting non-quantized layers to float16
93+
model_mixed_precision = AutoModelForCausalLM.from_pretrained(
94+
"facebook/opt-1.3b",
95+
quantization_config=quantization_config,
96+
torch_dtype=torch.float16,
97+
device_map="auto",
98+
)
99+
```
100+
101+
For further information, consult the [HuggingFace Documentation for BitsAndBytes](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes).
102+

0 commit comments

Comments
 (0)