Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit ef0882f

Browse files
authored
Remove redundant parameters for WOQ saving config and fix GPTQ issue (#1410)
* remove redundant parameters for WOQ Signed-off-by: changwangss <chang1.wang@intel.com>
1 parent 4b50461 commit ef0882f

File tree

7 files changed

+106
-47
lines changed

7 files changed

+106
-47
lines changed

examples/huggingface/pytorch/code-generation/quantization/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ pip install -r requirements.txt
1818
```
1919

2020
# Run
21-
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
21+
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
2222
>**Note**:
2323
> Model type "llama" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3.
2424
@@ -61,13 +61,13 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python ru
6161
# load_in_4bit
6262
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
6363
--model bigcode/starcoder \
64-
--load_in_4bit True \
64+
--load_in_4bit \
6565
--benchmark \
6666
--batch_size 1
6767
# load_in_8bit
6868
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
6969
--model bigcode/starcoder \
70-
--load_in_8bit True \
70+
--load_in_8bit \
7171
--benchmark \
7272
--batch_size 1
7373
```
@@ -124,7 +124,7 @@ python run_generation.py \
124124
# load_in_4bit
125125
python run_generation.py \
126126
--model bigcode/starcoder \
127-
--load_in_4bit True \
127+
--load_in_4bit \
128128
--accuracy \
129129
--batch_size 20 \
130130
--n_samples 20 \
@@ -135,7 +135,7 @@ python run_generation.py \
135135
# load_in_8bit
136136
python run_generation.py \
137137
--model bigcode/starcoder \
138-
--load_in_8bit True \
138+
--load_in_8bit \
139139
--accuracy \
140140
--batch_size 20 \
141141
--n_samples 20 \

examples/huggingface/pytorch/text-generation/quantization/README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,10 @@
22
We provide the inference benchmarking script `run_generation.py` for large language models, The following are the models we validated, more models are working in progress.
33

44
# Quantization for CPU device
5+
56
>**Note**:
67
> 1. default search algorithm is beam search with num_beams = 4.
7-
> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 with use_neural_speed=False.
8+
> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 for SmoothQuant.
89
## Prerequisite​
910
### Create Environment​
1011
Pytorch and Intel-extension-for-pytorch version 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.
@@ -21,7 +22,7 @@ pip install -r requirements.txt
2122
2223
2324
## Run
24-
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.
25+
We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.
2526
2627
### 1. Performance
2728
``` bash
@@ -108,13 +109,13 @@ python run_generation.py \
108109
# load_in_4bit
109110
python run_generation.py \
110111
--model EleutherAI/gpt-j-6b \
111-
--load_in_4bit True \
112+
--load_in_4bit \
112113
--accuracy \
113114
--tasks "lambada_openai"
114115
# load_in_8bit
115116
python run_generation.py \
116117
--model EleutherAI/gpt-j-6b \
117-
--load_in_8bit True \
118+
--load_in_8bit \
118119
--accuracy \
119120
--tasks "lambada_openai"
120121
# restore the model optimized with smoothquant
@@ -128,7 +129,7 @@ python run_generation.py \
128129

129130
```
130131

131-
# # Weight Only Quantization for GPU device
132+
# Weight Only Quantization for GPU device
132133
>**Note**:
133134
> 1. default search algorithm is beam search with num_beams = 1.
134135
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.

examples/huggingface/pytorch/text-generation/quantization/run_generation.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,7 @@
423423
args.model,
424424
trust_remote_code=args.trust_remote_code,
425425
_commit_hash=args._commit_hash,
426+
use_neural_speed=args.use_neural_speed,
426427
)
427428

428429
# save model

intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py

Lines changed: 43 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,9 @@
2323
from peft.peft_model import PEFT_TYPE_TO_MODEL_MAPPING, PeftType
2424
from peft.tuners.lora import LoraLayer, LoraModel
2525
from peft.utils.other import transpose
26-
from intel_extension_for_transformers.transformers.llm.quantization.autograd import matmul_kbit
26+
from intel_extension_for_transformers.transformers.llm.quantization.autograd import (
27+
matmul_kbit,
28+
)
2729
import intel_extension_for_transformers.qbits as qbits # pylint: disable=E0611, E0401
2830

2931

@@ -180,15 +182,13 @@ def set_weights_bias(
180182
group_dict[group_idx] = 0
181183
else:
182184
group_dict[group_idx] = group_dict[group_idx] + 1
183-
target_idx = group_idx * \
184-
group_size + group_dict[group_idx]
185+
target_idx = group_idx * group_size + group_dict[group_idx]
185186
int_weight2[target_idx] = int_weight[i]
186187
int_weight = int_weight2
187188
else:
188189
g_idx = torch.empty(0, dtype=torch.int32)
189190
else:
190191
g_idx = torch.empty(0, dtype=torch.int32)
191-
192192
if q_config.bits == 4:
193193
int_weight = (int_weight - 8) * 16
194194
gptq_scales = gptq_scales / 16
@@ -251,38 +251,65 @@ def quant_weight_w_scale(self, weight, scale, zp, group_size=-1):
251251
leng = weight.shape[1] // group_size
252252
tail_flag = False if weight.shape[1] % group_size == 0 else True
253253
for i in range(leng):
254-
int_weight_tmp = weight[:, i * group_size: (i + 1) * group_size].div_(
254+
int_weight_tmp = weight[:, i * group_size : (i + 1) * group_size].div_(
255255
scale[:, i].unsqueeze(1)
256256
)
257257
if zp is not None:
258258
int_weight_tmp.add_(zp[:, i].unsqueeze(1))
259-
int_weight[:, i * group_size: (i + 1) * group_size].copy_(
259+
int_weight[:, i * group_size : (i + 1) * group_size].copy_(
260260
int_weight_tmp.round_()
261261
)
262262
if tail_flag:
263-
int_weight_tmp = weight[:, leng * group_size:].div_(
263+
int_weight_tmp = weight[:, leng * group_size :].div_(
264264
scale[:, -1].unsqueeze(1)
265265
)
266266
if zp is not None:
267267
int_weight_tmp.add_(zp[:, -1].unsqueeze(1))
268-
int_weight[:, leng * group_size:].copy_(int_weight_tmp.round_())
268+
int_weight[:, leng * group_size :].copy_(int_weight_tmp.round_())
269269
return int_weight
270270

271271
def recover_qparms(self):
272+
def recover_idx(ret_idx, k, blocksize):
273+
g_idx = torch.zeros(k, dtype=int)
274+
value_range = (k + blocksize - 1) // blocksize
275+
for i in range(value_range):
276+
for j in range(blocksize):
277+
g_idx[ret_idx[i * blocksize + j]] = i
278+
return g_idx
279+
280+
def recover_int_weight(g_idx, int_weight):
281+
group_dict = {}
282+
ret_idx = torch.zeros(g_idx.shape, dtype=torch.int32)
283+
for i in range(len(g_idx)):
284+
group_idx = g_idx[i].item()
285+
if group_idx not in group_dict:
286+
target_idx = group_idx * group_size
287+
group_dict[group_idx] = 0
288+
else:
289+
group_dict[group_idx] = group_dict[group_idx] + 1
290+
target_idx = group_idx * group_size + group_dict[group_idx]
291+
ret_idx[i] = target_idx
292+
293+
int_weight2 = int_weight.clone().zero_()
294+
for i in range(len(ret_idx)):
295+
int_weight2[i] = int_weight[ret_idx[i]]
296+
int_weight = int_weight2
297+
return int_weight
298+
272299
group_size = qbits.acquire_packed_weight_info(self.weight, 1)[0]
273300
in_features = qbits.acquire_packed_weight_info(self.weight, 2)[0]
274301
out_features = qbits.acquire_packed_weight_info(self.weight, 3)[0]
275302
desc_act = qbits.acquire_packed_weight_info(self.weight, 4)[0] != 0
276303
if desc_act:
277304
g_idx = qbits.acquire_packed_weight_info(self.weight, 5)
305+
g_idx = recover_idx(g_idx, in_features, group_size)
278306
else:
279307
g_idx = None
280308
weight_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 6)
281309
weight_dtype = "".join(
282310
chr(ascii_code) for ascii_code in weight_dtype_ascii.tolist()
283311
)
284-
bits = 4 if weight_dtype in [
285-
"nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
312+
bits = 4 if weight_dtype in ["nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
286313
compute_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 7)
287314
compute_dtype = "".join(
288315
chr(ascii_code) for ascii_code in compute_dtype_ascii.tolist()
@@ -319,6 +346,10 @@ def recover_qparms(self):
319346
group_size=group_size,
320347
)
321348

349+
if g_idx is not None:
350+
int_weight = recover_int_weight(g_idx, int_weight.t())
351+
int_weight = int_weight.t()
352+
322353
scales_dtype = torch.float32 if scales_dtype in ["fp32"] else None
323354
return (
324355
group_size,
@@ -361,15 +392,13 @@ def __init__(
361392
scheme=kwargs.get("scheme", "sym"),
362393
device=kwargs.get("device", None),
363394
)
364-
LoraLayer.__init__(self, in_features=in_features,
365-
out_features=out_features)
395+
LoraLayer.__init__(self, in_features=in_features, out_features=out_features)
366396

367397
# Freezing the pre-trained weight matrix
368398
self.weight.requires_grad = False
369399

370400
init_lora_weights = kwargs.pop("init_lora_weights", True)
371-
self.update_layer(adapter_name, r, lora_alpha,
372-
lora_dropout, init_lora_weights)
401+
self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
373402
qbits_customop_available = True
374403
try:
375404
qbits.dropout_fwd

intel_extension_for_transformers/transformers/llm/quantization/utils.py

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,10 @@
2626
from neural_compressor.adaptor.torch_utils.model_wrapper import WeightOnlyLinear
2727
from neural_compressor.utils.utility import LazyImport
2828
from neural_compressor.config import PostTrainingQuantConfig
29-
from intel_extension_for_transformers.tools.utils import is_ipex_available, is_autoround_available
29+
from intel_extension_for_transformers.tools.utils import (
30+
is_ipex_available,
31+
is_autoround_available,
32+
)
3033
from transformers import AutoTokenizer
3134

3235
if is_ipex_available():
@@ -71,7 +74,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
7174
except:
7275
# zeros and scales have different iteam numbers.
7376
# remove 1 (due to 0 + 1 in line 68)
74-
zeros = zeros[zeros !=1]
77+
zeros = zeros[zeros != 1]
7578
zeros = zeros.reshape(scales.shape)
7679

7780
# due to INC asym return torch.uint8 but backend request int8,
@@ -92,7 +95,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
9295
# due to INC asym return torch.uint8 but backend request int8,
9396
# change it to int8 with offset 128
9497
if not sym:
95-
weight = (weight.to(torch.int32) - 128). to(torch.int8)
98+
weight = (weight.to(torch.int32) - 128).to(torch.int8)
9699
return weight, scales, zeros
97100

98101

@@ -258,8 +261,13 @@ def _replace_linear(
258261
# Force requires grad to False to avoid unexpected errors
259262
model._modules[name].requires_grad_(False)
260263
if device == "cpu" or device == torch.device("cpu") or device == "auto":
261-
if quantization_config.weight_dtype in \
262-
["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
264+
if quantization_config.weight_dtype in [
265+
"fp8_e5m2",
266+
"fp8_e4m3",
267+
"nf4",
268+
"fp4",
269+
"int4_fullrange",
270+
]:
263271
model._modules[name].set_fp_weights_bias(
264272
module.weight.data,
265273
None if module.bias is None else module.bias.data,
@@ -324,13 +332,17 @@ def convert_to_quantized_model(model, config, device="cpu"):
324332
calib_dataloader = config.calib_dataloader
325333
calib_func = config.calib_func
326334
calib_iters = config.calib_iters
335+
calib_dataset = config.dataset
327336
model_device = next(model.parameters()).device
328337

329-
if calib_dataloader is None and config.quant_method.value not in ["rtn"]:
338+
if (
339+
calib_dataloader is None
340+
and config.quant_method.value not in ["rtn"]
341+
and calib_dataset is not None
342+
):
330343
from datasets import load_dataset
331344
from torch.utils.data import DataLoader
332345

333-
calib_dataset = config.calib_dataset
334346
if isinstance(calib_dataset, (str, bytes, os.PathLike)):
335347
calib_dataset = load_dataset(calib_dataset, split="train")
336348
calib_dataset = calib_dataset.shuffle(seed=42)
@@ -442,7 +454,7 @@ def default_calib_func(model):
442454
True if "fullrange" in config.weight_dtype else False
443455
),
444456
"enable_mse_search": config.mse_range,
445-
}
457+
},
446458
}
447459
algorithm = "RTN"
448460
elif config.quant_method.value == "awq":
@@ -470,7 +482,7 @@ def default_calib_func(model):
470482
"use_max_length": True if config.max_input_length else False,
471483
"pad_max_length": config.max_input_length,
472484
"static_groups": config.static_groups,
473-
}
485+
},
474486
}
475487
algorithm = "GPTQ"
476488
elif config.quant_method.value == "autoround":

intel_extension_for_transformers/transformers/modeling/modeling_auto.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,8 @@ def recover_export_model(model, current_key_name=None):
127127
model._modules[name].pack(
128128
int_weight, scales, zeros, module.bias, g_idx=g_idx
129129
)
130+
if g_idx is not None:
131+
model._modules[name].g_idx = g_idx
130132

131133
if len(list(module.children())) > 0: # pylint: disable=E1101
132134
_ = recover_export_model(module, current_key_name)
@@ -179,8 +181,13 @@ def convert_model_to_public(model):
179181
module.qweight.data = module.qweight.t_().contiguous()
180182
module.scales.data = module.scales.t_().contiguous()
181183
module.weight_transposed = False
182-
elif model.quantization_config.weight_dtype not in \
183-
["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
184+
elif model.quantization_config.weight_dtype not in [
185+
"fp8_e5m2",
186+
"fp8_e4m3",
187+
"nf4",
188+
"fp4",
189+
"int4_fullrange",
190+
]:
184191
model = recover_export_model(model)
185192

186193

@@ -368,8 +375,10 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
368375
exit(0)
369376

370377
if config.model_type in cls.model_type_list and not use_xpu:
371-
if isinstance(quantization_config,
372-
GPTQConfig) and config.model_type not in cls.model_type_list_for_gptq:
378+
if (
379+
isinstance(quantization_config, GPTQConfig)
380+
and config.model_type not in cls.model_type_list_for_gptq
381+
):
373382
use_neural_speed = False
374383
else:
375384
use_neural_speed = True
@@ -609,7 +618,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
609618
model = convert_to_quantized_model(
610619
model, quantization_config, device=device_map
611620
)
612-
quantization_config.tokenizer = None
621+
quantization_config.remove_redundant_parameters()
613622
model.config.quantization_config = quantization_config
614623

615624
# add quantization_config and save_low_bit to pretrained model dynamically

0 commit comments

Comments
 (0)