Remove redundant parameters for WOQ saving config and fix GPTQ issue (#1410)

changwangss · web-flow · commit ef0882f66f86 · 2024-03-25T08:49:38.000+08:00
* remove redundant parameters for WOQ

Signed-off-by: changwangss &lt;chang1.wang@intel.com&gt;
diff --git a/examples/huggingface/pytorch/code-generation/quantization/README.md b/examples/huggingface/pytorch/code-generation/quantization/README.md
@@ -18,7 +18,7 @@ pip install -r requirements.txt
 ```
 
 # Run
-We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
+We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, the followings are command to show how to use it.
 >**Note**: 
 > Model type "llama" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3.
 
@@ -61,13 +61,13 @@ OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python ru
 # load_in_4bit
 OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
     --model bigcode/starcoder \
-    --load_in_4bit True \
+    --load_in_4bit \
     --benchmark \
     --batch_size 1
 # load_in_8bit
 OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generation.py \
     --model bigcode/starcoder \
-    --load_in_8bit True \
+    --load_in_8bit \
     --benchmark \
     --batch_size 1
 ```
@@ -124,7 +124,7 @@ python run_generation.py \
 # load_in_4bit
 python run_generation.py \
     --model bigcode/starcoder \
-    --load_in_4bit True \
+    --load_in_4bit \
     --accuracy \
     --batch_size 20 \
     --n_samples 20 \
@@ -135,7 +135,7 @@ python run_generation.py \
 # load_in_8bit
 python run_generation.py \
     --model bigcode/starcoder \
-    --load_in_8bit True \
+    --load_in_8bit \
     --accuracy \
     --batch_size 20 \
     --n_samples 20 \
diff --git a/examples/huggingface/pytorch/text-generation/quantization/README.md b/examples/huggingface/pytorch/text-generation/quantization/README.md
@@ -2,9 +2,10 @@
 We provide the inference benchmarking script `run_generation.py` for large language models, The following are the models we validated, more models are working in progress.
 
 # Quantization for CPU device
+
 >**Note**: 
 > 1.  default search algorithm is beam search with num_beams = 4.
-> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 with use_neural_speed=False.
+> 2. Model type "gptj", "opt", "llama" and "falcon" will default use [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/339bd251841e153ad9c34e1033ab8b2d936a1781/docs/tutorials/llm/llm_optimize_transformers.md) to accelerate the inference, but "llama" requests transformers version lower than 4.36.0, "falcon" requests transformers version lower than 4.33.3 for SmoothQuant.
 ## Prerequisite​
 ### Create Environment​
 Pytorch and Intel-extension-for-pytorch version 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.
@@ -21,7 +22,7 @@ pip install -r requirements.txt
 
 
 ## Run
-We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `RTN/AWQ/TEQ` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.
+We provide compression technologies such as `MixedPrecision`, `SmoothQuant` and `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms and `BitsandBytes`, `load_in_4bit` and `load_in_8bit` work on CPU device, and also support `PEFT` optimized model compression, the followings are command to show how to use it.
 
 ### 1. Performance
 ``` bash
@@ -108,13 +109,13 @@ python run_generation.py \
 # load_in_4bit
 python run_generation.py \
     --model EleutherAI/gpt-j-6b \
-    --load_in_4bit True \
+    --load_in_4bit \
     --accuracy \
     --tasks "lambada_openai"
 # load_in_8bit
 python run_generation.py \
     --model EleutherAI/gpt-j-6b \
-    --load_in_8bit True \
+    --load_in_8bit \
     --accuracy \
     --tasks "lambada_openai"
 # restore the model optimized with smoothquant
@@ -128,7 +129,7 @@ python run_generation.py \
 
 ```
 
-# # Weight Only Quantization for GPU device
+# Weight Only Quantization for GPU device
 >**Note**: 
 > 1.  default search algorithm is beam search with num_beams = 1.
 > 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
diff --git a/examples/huggingface/pytorch/text-generation/quantization/run_generation.py b/examples/huggingface/pytorch/text-generation/quantization/run_generation.py
@@ -423,6 +423,7 @@
             args.model,
             trust_remote_code=args.trust_remote_code,
             _commit_hash=args._commit_hash,
+            use_neural_speed=args.use_neural_speed,
         )
 
 # save model
diff --git a/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py b/intel_extension_for_transformers/transformers/llm/quantization/nn/modules.py
@@ -23,7 +23,9 @@
 from peft.peft_model import PEFT_TYPE_TO_MODEL_MAPPING, PeftType
 from peft.tuners.lora import LoraLayer, LoraModel
 from peft.utils.other import transpose
-from intel_extension_for_transformers.transformers.llm.quantization.autograd import matmul_kbit
+from intel_extension_for_transformers.transformers.llm.quantization.autograd import (
+    matmul_kbit,
+)
 import intel_extension_for_transformers.qbits as qbits  # pylint: disable=E0611, E0401
 
 
@@ -180,15 +182,13 @@ def set_weights_bias(
                             group_dict[group_idx] = 0
                         else:
                             group_dict[group_idx] = group_dict[group_idx] + 1
-                            target_idx = group_idx * \
-                                group_size + group_dict[group_idx]
+                            target_idx = group_idx * group_size + group_dict[group_idx]
                         int_weight2[target_idx] = int_weight[i]
                     int_weight = int_weight2
                 else:
                     g_idx = torch.empty(0, dtype=torch.int32)
             else:
                 g_idx = torch.empty(0, dtype=torch.int32)
-
         if q_config.bits == 4:
             int_weight = (int_weight - 8) * 16
             gptq_scales = gptq_scales / 16
@@ -251,38 +251,65 @@ def quant_weight_w_scale(self, weight, scale, zp, group_size=-1):
         leng = weight.shape[1] // group_size
         tail_flag = False if weight.shape[1] % group_size == 0 else True
         for i in range(leng):
-            int_weight_tmp = weight[:, i * group_size: (i + 1) * group_size].div_(
+            int_weight_tmp = weight[:, i * group_size : (i + 1) * group_size].div_(
                 scale[:, i].unsqueeze(1)
             )
             if zp is not None:
                 int_weight_tmp.add_(zp[:, i].unsqueeze(1))
-            int_weight[:, i * group_size: (i + 1) * group_size].copy_(
+            int_weight[:, i * group_size : (i + 1) * group_size].copy_(
                 int_weight_tmp.round_()
             )
         if tail_flag:
-            int_weight_tmp = weight[:, leng * group_size:].div_(
+            int_weight_tmp = weight[:, leng * group_size :].div_(
                 scale[:, -1].unsqueeze(1)
             )
             if zp is not None:
                 int_weight_tmp.add_(zp[:, -1].unsqueeze(1))
-            int_weight[:, leng * group_size:].copy_(int_weight_tmp.round_())
+            int_weight[:, leng * group_size :].copy_(int_weight_tmp.round_())
         return int_weight
 
     def recover_qparms(self):
+        def recover_idx(ret_idx, k, blocksize):
+            g_idx = torch.zeros(k, dtype=int)
+            value_range = (k + blocksize - 1) // blocksize
+            for i in range(value_range):
+                for j in range(blocksize):
+                    g_idx[ret_idx[i * blocksize + j]] = i
+            return g_idx
+
+        def recover_int_weight(g_idx, int_weight):
+            group_dict = {}
+            ret_idx = torch.zeros(g_idx.shape, dtype=torch.int32)
+            for i in range(len(g_idx)):
+                group_idx = g_idx[i].item()
+                if group_idx not in group_dict:
+                    target_idx = group_idx * group_size
+                    group_dict[group_idx] = 0
+                else:
+                    group_dict[group_idx] = group_dict[group_idx] + 1
+                    target_idx = group_idx * group_size + group_dict[group_idx]
+                ret_idx[i] = target_idx
+
+            int_weight2 = int_weight.clone().zero_()
+            for i in range(len(ret_idx)):
+                int_weight2[i] = int_weight[ret_idx[i]]
+            int_weight = int_weight2
+            return int_weight
+
         group_size = qbits.acquire_packed_weight_info(self.weight, 1)[0]
         in_features = qbits.acquire_packed_weight_info(self.weight, 2)[0]
         out_features = qbits.acquire_packed_weight_info(self.weight, 3)[0]
         desc_act = qbits.acquire_packed_weight_info(self.weight, 4)[0] != 0
         if desc_act:
             g_idx = qbits.acquire_packed_weight_info(self.weight, 5)
+            g_idx = recover_idx(g_idx, in_features, group_size)
         else:
             g_idx = None
         weight_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 6)
         weight_dtype = "".join(
             chr(ascii_code) for ascii_code in weight_dtype_ascii.tolist()
         )
-        bits = 4 if weight_dtype in [
-            "nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
+        bits = 4 if weight_dtype in ["nf4", "int4_clip", "fp4", "int4_fullrange"] else 8
         compute_dtype_ascii = qbits.acquire_packed_weight_info(self.weight, 7)
         compute_dtype = "".join(
             chr(ascii_code) for ascii_code in compute_dtype_ascii.tolist()
@@ -319,6 +346,10 @@ def recover_qparms(self):
             group_size=group_size,
         )
 
+        if g_idx is not None:
+            int_weight = recover_int_weight(g_idx, int_weight.t())
+            int_weight = int_weight.t()
+
         scales_dtype = torch.float32 if scales_dtype in ["fp32"] else None
         return (
             group_size,
@@ -361,15 +392,13 @@ def __init__(
             scheme=kwargs.get("scheme", "sym"),
             device=kwargs.get("device", None),
         )
-        LoraLayer.__init__(self, in_features=in_features,
-                           out_features=out_features)
+        LoraLayer.__init__(self, in_features=in_features, out_features=out_features)
 
         # Freezing the pre-trained weight matrix
         self.weight.requires_grad = False
 
         init_lora_weights = kwargs.pop("init_lora_weights", True)
-        self.update_layer(adapter_name, r, lora_alpha,
-                          lora_dropout, init_lora_weights)
+        self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
         qbits_customop_available = True
         try:
             qbits.dropout_fwd
diff --git a/intel_extension_for_transformers/transformers/llm/quantization/utils.py b/intel_extension_for_transformers/transformers/llm/quantization/utils.py
@@ -26,7 +26,10 @@
 from neural_compressor.adaptor.torch_utils.model_wrapper import WeightOnlyLinear
 from neural_compressor.utils.utility import LazyImport
 from neural_compressor.config import PostTrainingQuantConfig
-from intel_extension_for_transformers.tools.utils import is_ipex_available, is_autoround_available
+from intel_extension_for_transformers.tools.utils import (
+    is_ipex_available,
+    is_autoround_available,
+)
 from transformers import AutoTokenizer
 
 if is_ipex_available():
@@ -71,7 +74,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
     except:
         # zeros and scales have different iteam numbers.
         # remove 1 (due to 0 + 1 in line 68)
-        zeros = zeros[zeros !=1]
+        zeros = zeros[zeros != 1]
         zeros = zeros.reshape(scales.shape)
 
     # due to INC asym return torch.uint8 but backend request int8,
@@ -92,7 +95,7 @@ def unpack_weight(qweight, scales, qzeros, q_config):
         # due to INC asym return torch.uint8 but backend request int8,
         # change it to int8 with offset 128
         if not sym:
-            weight = (weight.to(torch.int32) - 128). to(torch.int8)
+            weight = (weight.to(torch.int32) - 128).to(torch.int8)
     return weight, scales, zeros
 
 
@@ -258,8 +261,13 @@ def _replace_linear(
                     # Force requires grad to False to avoid unexpected errors
                     model._modules[name].requires_grad_(False)
                 if device == "cpu" or device == torch.device("cpu") or device == "auto":
-                    if quantization_config.weight_dtype in \
-                                    ["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
+                    if quantization_config.weight_dtype in [
+                        "fp8_e5m2",
+                        "fp8_e4m3",
+                        "nf4",
+                        "fp4",
+                        "int4_fullrange",
+                    ]:
                         model._modules[name].set_fp_weights_bias(
                             module.weight.data,
                             None if module.bias is None else module.bias.data,
@@ -324,13 +332,17 @@ def convert_to_quantized_model(model, config, device="cpu"):
     calib_dataloader = config.calib_dataloader
     calib_func = config.calib_func
     calib_iters = config.calib_iters
+    calib_dataset = config.dataset
     model_device = next(model.parameters()).device
 
-    if calib_dataloader is None and config.quant_method.value not in ["rtn"]:
+    if (
+        calib_dataloader is None
+        and config.quant_method.value not in ["rtn"]
+        and calib_dataset is not None
+    ):
         from datasets import load_dataset
         from torch.utils.data import DataLoader
 
-        calib_dataset = config.calib_dataset
         if isinstance(calib_dataset, (str, bytes, os.PathLike)):
             calib_dataset = load_dataset(calib_dataset, split="train")
         calib_dataset = calib_dataset.shuffle(seed=42)
@@ -442,7 +454,7 @@ def default_calib_func(model):
                         True if "fullrange" in config.weight_dtype else False
                     ),
                     "enable_mse_search": config.mse_range,
-                }
+                },
             }
             algorithm = "RTN"
         elif config.quant_method.value == "awq":
@@ -470,7 +482,7 @@ def default_calib_func(model):
                     "use_max_length": True if config.max_input_length else False,
                     "pad_max_length": config.max_input_length,
                     "static_groups": config.static_groups,
-                }
+                },
             }
             algorithm = "GPTQ"
         elif config.quant_method.value == "autoround":
diff --git a/intel_extension_for_transformers/transformers/modeling/modeling_auto.py b/intel_extension_for_transformers/transformers/modeling/modeling_auto.py
@@ -127,6 +127,8 @@ def recover_export_model(model, current_key_name=None):
             model._modules[name].pack(
                 int_weight, scales, zeros, module.bias, g_idx=g_idx
             )
+            if g_idx is not None:
+                model._modules[name].g_idx = g_idx
 
         if len(list(module.children())) > 0:  # pylint: disable=E1101
             _ = recover_export_model(module, current_key_name)
@@ -179,8 +181,13 @@ def convert_model_to_public(model):
                     module.qweight.data = module.qweight.t_().contiguous()
                     module.scales.data = module.scales.t_().contiguous()
                     module.weight_transposed = False
-    elif model.quantization_config.weight_dtype not in \
-        ["fp8_e5m2", "fp8_e4m3", "nf4", "fp4", "int4_fullrange"]:
+    elif model.quantization_config.weight_dtype not in [
+        "fp8_e5m2",
+        "fp8_e4m3",
+        "nf4",
+        "fp4",
+        "int4_fullrange",
+    ]:
         model = recover_export_model(model)
 
 
@@ -368,8 +375,10 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
                 exit(0)
 
             if config.model_type in cls.model_type_list and not use_xpu:
-                if isinstance(quantization_config,
-                              GPTQConfig) and config.model_type not in cls.model_type_list_for_gptq:
+                if (
+                    isinstance(quantization_config, GPTQConfig)
+                    and config.model_type not in cls.model_type_list_for_gptq
+                ):
                     use_neural_speed = False
                 else:
                     use_neural_speed = True
@@ -609,7 +618,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
                 model = convert_to_quantized_model(
                     model, quantization_config, device=device_map
                 )
-                quantization_config.tokenizer = None
+                quantization_config.remove_redundant_parameters()
                 model.config.quantization_config = quantization_config
 
             # add quantization_config and save_low_bit to pretrained model dynamically
diff --git a/intel_extension_for_transformers/transformers/utils/config.py b/intel_extension_for_transformers/transformers/utils/config.py

Original file line number	Diff line number	Diff line change
`@@ -423,6 +423,7 @@`
`423`	`423`	`args.model,`
`424`	`424`	`trust_remote_code=args.trust_remote_code,`
`425`	`425`	`_commit_hash=args._commit_hash,`
	`426`	`+ use_neural_speed=args.use_neural_speed,`
`426`	`427`	`)`
`427`	`428`
`428`	`429`	`# save model`