|
| 1 | +--- |
| 2 | +sidebar_position: 3 |
| 3 | +--- |
| 4 | + |
| 5 | +# Quantization |
| 6 | + |
| 7 | +This example demonstrates the use of quantization within the ServerlessLLM framework to optimize model serving. Quantization is a technique used to reduce the memory footprint and computational requirements of a large language model by representing its weights with lower-precision data types, such as 8-bit integers (int8). This example will showcase how to deploy and serve a quantized model in a ServerlessLLM cluster. |
| 8 | + |
| 9 | +## Pre-requisites |
| 10 | + |
| 11 | +We will use Docker Compose to run a ServerlessLLM cluster in this example. Therefore, please make sure you have read the Quickstart Guide before proceeding. |
| 12 | + |
| 13 | +## Usage |
| 14 | +Start a local Docker-based ray cluster using Docker Compose. |
| 15 | + |
| 16 | +## Step 1: Set up the Environment |
| 17 | + |
| 18 | +Create a directory for this example and download the `docker-compose.yml` file. |
| 19 | + |
| 20 | +```bash |
| 21 | +mkdir sllm-quantization-example && cd sllm-quantization-example |
| 22 | +curl -O https://raw.githubusercontent.com/ServerlessLLM/ServerlessLLM/main/examples/docker/docker-compose.yml |
| 23 | + |
| 24 | +## Step 2: Configuration |
| 25 | + |
| 26 | +Set the Model Directory. Create a directory on your host machine where models will be stored and set the `MODEL_FOLDER` environment variable to point to this directory: |
| 27 | + |
| 28 | +```bash |
| 29 | +export MODEL_FOLDER=/path/to/your/models |
| 30 | +``` |
| 31 | + |
| 32 | +Replace `/path/to/your/models` with the actual path where you want to store the models. |
| 33 | + |
| 34 | +## Step 3: Start the Services |
| 35 | + |
| 36 | +Start the ServerlessLLM services using Docker Compose: |
| 37 | + |
| 38 | +```bash |
| 39 | +docker compose up -d |
| 40 | +``` |
| 41 | + |
| 42 | +This command will start the Ray head node and two worker nodes defined in the `docker-compose.yml` file. |
| 43 | + |
| 44 | +:::tip |
| 45 | +Use the following command to monitor the logs of the head node: |
| 46 | + |
| 47 | +```bash |
| 48 | +docker logs -f sllm_head |
| 49 | +``` |
| 50 | +::: |
| 51 | + |
| 52 | +## Step 4: Create Quantization and Deployment Configurations |
| 53 | + |
| 54 | +First, we'll generate a standard Hugging Face BitsAndBytesConfig and save it to a JSON file. Then, we'll create a deployment configuration file with these quantization settings embedded in it. |
| 55 | + |
| 56 | +1. Generate the Quantization Config |
| 57 | + |
| 58 | +Create a Python script named `get_config.py` in the current directory with the following content: |
| 59 | +```python |
| 60 | +# get_config.py |
| 61 | +from transformers import BitsAndBytesConfig |
| 62 | +
|
| 63 | +quantization_config = BitsAndBytesConfig(load_in_4bit=True) |
| 64 | +quantization_config.to_json_file("quantization_config.json") |
| 65 | +
|
| 66 | +``` |
| 67 | + |
| 68 | +Run the script to generate `quantization_config.json`: |
| 69 | +```bash |
| 70 | +python get_config.py |
| 71 | +``` |
| 72 | + |
| 73 | + |
| 74 | +2. Create the Deployment Config |
| 75 | + |
| 76 | +Now, create a file named `quantized_deploy_config.json`. This file tells ServerlessLLM which model to deploy and instructs the backend to use the quantization settings. You should copy the contents of `quantization_config.json` into the `quantization_config` field below. A template can be found in `sllm/cli/default_config.json`. |
| 77 | + |
| 78 | +```json |
| 79 | +{ |
| 80 | + "model": "facebook/opt-1.3b", |
| 81 | + "backend": "transformers", |
| 82 | + "num_gpus": 1, |
| 83 | + "auto_scaling_config": { |
| 84 | + "metric": "concurrency", |
| 85 | + "target": 1, |
| 86 | + "min_instances": 0, |
| 87 | + "max_instances": 10, |
| 88 | + "keep_alive": 0 |
| 89 | + }, |
| 90 | + "backend_config": { |
| 91 | + "pretrained_model_name_or_path": "", |
| 92 | + "device_map": "auto", |
| 93 | + "torch_dtype": "float16", |
| 94 | + "hf_model_class": "AutoModelForCausalLM", |
| 95 | + "quantization_config": { |
| 96 | + "_load_in_4bit": true, |
| 97 | + "_load_in_8bit": false, |
| 98 | + "bnb_4bit_compute_dtype": "float32", |
| 99 | + "bnb_4bit_quant_storage": "uint8", |
| 100 | + "bnb_4bit_quant_type": "fp4", |
| 101 | + "bnb_4bit_use_double_quant": false, |
| 102 | + "llm_int8_enable_fp32_cpu_offload": false, |
| 103 | + "llm_int8_has_fp16_weight": false, |
| 104 | + "llm_int8_skip_modules": null, |
| 105 | + "llm_int8_threshold": 6.0, |
| 106 | + "load_in_4bit": true, |
| 107 | + "load_in_8bit": false, |
| 108 | + "quant_method": "bitsandbytes" |
| 109 | + } |
| 110 | + } |
| 111 | +} |
| 112 | +
|
| 113 | +``` |
| 114 | + |
| 115 | +> Note: Quantization currently only supports the "transformers" backend. Support for other backends will come soon. |
| 116 | + |
| 117 | +## Step 5: Deploy the Quantized Model |
| 118 | +With the configuration files in place, deploy the model using the `sllm-cli`. |
| 119 | + |
| 120 | +```bash |
| 121 | +conda activate sllm |
| 122 | +export LLM_SERVER_URL=http://127.0.0.1:8343 |
| 123 | +
|
| 124 | +sllm-cli deploy --config quantized_deploy_config.json |
| 125 | +``` |
| 126 | + |
| 127 | +## Step 6: Verify the deployment. |
| 128 | +Send an inference to the server to query the model: |
| 129 | + |
| 130 | +```bash |
| 131 | +curl $LLM_SERVER_URL/v1/chat/completions \ |
| 132 | +-H "Content-Type: application/json" \ |
| 133 | +-d '{ |
| 134 | + "model": "facebook/opt-1.3b", |
| 135 | + "messages": [ |
| 136 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 137 | + {"role": "user", "content": "What is your name?"} |
| 138 | + ] |
| 139 | + }' |
| 140 | +``` |
| 141 | + |
| 142 | +To verify the model is being loaded in the desired precision, check the logs (`docker logs sllm_head`). You should see that the model is indeed being loaded in `fp4`. |
| 143 | + |
| 144 | + |
| 145 | +```log |
| 146 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:321] load config takes 0.0030286312103271484 seconds |
| 147 | +(RoundRobinRouter pid=481) INFO 07-02 20:01:49 roundrobin_router.py:272] [] |
| 148 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:331] load model takes 0.2806234359741211 seconds |
| 149 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:338] device_map: OrderedDict([('', 0)]) |
| 150 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:345] compute_device_placement takes 0.18753838539123535 seconds |
| 151 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:376] allocate_cuda_memory takes 0.0020012855529785156 seconds |
| 152 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 client.py:72] load_into_gpu: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa |
| 153 | +(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:49 client.py:113] Model loaded: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa |
| 154 | +(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:49 transformers.py:398] restore state_dict takes 0.0007319450378417969 seconds |
| 155 | +(TransformersBackend pid=352, ip=172.18.0.2) DEBUG 07-02 20:01:49 transformers.py:411] using precision: fp4 |
| 156 | +(TransformersBackend pid=352, ip=172.18.0.2) INFO 07-02 20:01:50 client.py:117] confirm_model_loaded: transformers/facebook/opt-1.3b, 70b42a05-4faa-4eaf-bb73-512c6453e7fa |
| 157 | +``` |
| 158 | + |
| 159 | +You should receive a successful JSON response from the model. |
| 160 | + |
| 161 | +## Step 7: Clean Up |
| 162 | + |
| 163 | +Delete the model deployment by running the following command: |
| 164 | + |
| 165 | +```bash |
| 166 | +sllm-cli delete facebook/opt-1.3b |
| 167 | +``` |
| 168 | + |
| 169 | +If you need to stop and remove the containers, you can use the following commands: |
| 170 | + |
| 171 | +```bash |
| 172 | +docker compose down |
| 173 | +``` |
| 174 | + |
| 175 | + |
0 commit comments