Add text-gen finetune workflow for glue mnli (#1478)

mini-goel · pre-commit-ci[bot] · VincyZhang · web-flow · commit 5e35e4723c5c · 2024-06-07T16:52:27.000+08:00
* Add text-gen finetune workflow for glue mnli

Signed-off-by: mini-goel &lt;104451502+mini-goel@users.noreply.github.com&gt;

---------

Signed-off-by: mini-goel &lt;104451502+mini-goel@users.noreply.github.com&gt;
Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: VincyZhang &lt;wenxin.zhang@intel.com&gt;
diff --git a/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/README.md b/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/README.md
@@ -0,0 +1,110 @@
+# GPT-J fine-tuning and inference
+
+
+1. [Introduction](#introduction)
+2. [Get Started](#get-started)
+
+# Introduction
+
+GPT-J 6B is an open-source large language model (LLM) with 6B parameters. Like GPT-3, it is an autoregressive, decoder-only transformer model designed to solve natural language processing (NLP) tasks by predicting how a piece of text will continue from a prompt. It is pre-trained on the Pile dataset, using the Mesh Transformer JAX library in JAX to handle the parallelization scheme with a vocab size of 50257 tokens.
+
+This example demonstrates an end-end LLM fine-tuning workflow using [Glue MNLI](https://huggingface.co/datasets/glue/viewer/mnli/train) dataset and [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) technique to optimize the fine-tuning time.
+
+This workflow runs on Intel Xeon CPU platform and has been verified on Intel 4th Gen Xeon CPU.
+
+### Dataset
+The MNLI dataset consists of pairs of sentences, a premise and a hypothesis. The task is to predict the relation between the premise and the hypothesis, which can be:
+Entailment: the premise entails the hypothesis, 
+Contradiction: hypothesis contradicts the premise and
+Neutral: hypothesis and premise are unrelated. The prompt given to the model combines both premise and hypothesis and the fine-tuned model generates/predicts the relation as a next token.
+
+### Fine-tuning and Inference
+GPTJ-6B model is finetuned for NLI (Natural Language Inference) task using Glue MNLI Dataset. LoRA (PEFT) technique is applied which decomposes the LLM's weight matrices into smaller, lower-rank matrices called LoRA adapters. This significantly reduces the number of trainable parameters and thus the fine-tuning time without impacting the performance.
+The fine-tuned model is then evaluated using Hugging face pipeline API using 'validation' dataset achieving SOTA accuracy.
+
+
+# Get Started
+
+### 1. Download the Workflow Repository
+
+Clone the GPT-J repo for fine-tuning
+```
+git clone https://github.com/intel/intel-extension-for-transformers/tree/main
+cd intel_extension_for_transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation
+```
+
+### 2. Create environment and install software packages
+
+Create miniconda environment and install the required packages
+```
+conda create -n gptj_ft_env python=3.10
+conda activate gptj_ft_env
+pip install -r requirements.txt
+```
+
+Install the following for best performance:
+```
+conda install mkl mkl-include -y
+conda install gperftools jemalloc==5.2.1 -c conda-forge -y
+```
+
+### 3. Prepare dataset
+We use the [Glue MNLI](https://huggingface.co/datasets/glue/viewer/mnli/train) dataset from hugging face.
+
+## Fine-tuning
+
+### Single-node fine-tuning
+Set the following:
+```
+export KMP_BLOCKTIME=1
+export KMP_SETTINGS=1
+export KMP_AFFINITY=granularity=fine,compact,1,0
+```
+
+Set libiomp
+```
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
+```
+
+Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
+```
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+```
+
+Run command or use singlenode_runscript.sh
+
+We also supported Distributed Data Parallel finetuning on single node and multi nodes settings. 
+Below command is to run single node fine-tuning.
+
+
+```
+python finetune_clm.py \
+        --model_name_or_path "EleutherAI/gpt-j-6B" \
+        --bf16 True \
+        --dataset_name  "glue" \
+        --dataset_config_name, "mnli" \
+        --dataset_concatenation \
+        --config_name ./config.json \
+        --per_device_train_batch_size 8 \
+        --per_device_eval_batch_size 8 \
+        --gradient_accumulation_steps 1 \
+        --do_train \
+        --do_eval \
+        --learning_rate 3.3113761e-4 \
+        --num_train_epochs 3 \
+        --logging_steps 100 \
+        --save_total_limit 2 \
+        --overwrite_output_dir \
+        --log_level info \
+        --save_strategy epoch \
+        --output_dir ./gptj_peft_finetuned_model \
+        --peft lora \
+        --lora_alpha 54 \
+        --lora_target_modules q_proj,v_proj,k_proj,out_proj \
+        --use_fast_tokenizer false \
+        --use_cpu \
+        --task completion \
+        --max_train_samples 5000 \
+        --max_eval_samples  500  
+
+```
diff --git a/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/config.json b/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/config.json
@@ -0,0 +1,42 @@
+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPTJForCausalLM"
+  ],
+  "attn_pdrop": 0.0,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.0,
+  "eos_token_id": 50256,
+  "gradient_checkpointing": false,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gptj",
+  "n_embd": 4096,
+  "n_head": 16,
+  "n_inner": null,
+  "n_layer": 28,
+  "n_positions": 2048,
+  "resid_pdrop": 0.0,
+  "rotary": true,
+  "rotary_dim": 64,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 250,
+      "temperature": 0.2,
+      "top_p": 0.4,
+      "top_k": 70 
+    }
+  },
+  "tie_word_embeddings": false,
+  "tokenizer_class": "GPT2Tokenizer",
+  "transformers_version": "4.18.0.dev0",
+  "use_cache": true,
+  "vocab_size": 50400
+}
diff --git a/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/finetune_clm.py b/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/finetune_clm.py
@@ -0,0 +1,68 @@
+# !/usr/bin/env python
+# -*- coding: utf-8 -*-
+#
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import sys
+from transformers import TrainingArguments, HfArgumentParser
+from intel_extension_for_transformers.neural_chat.config import (
+    ModelArguments,
+    DataArguments,
+    FinetuningArguments,
+    TextGenerationFinetuningConfig,
+)
+from intel_extension_for_transformers.neural_chat.chatbot import finetune_model
+from intel_extension_for_transformers.neural_chat.utils.common import is_hpu_available
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    if not is_hpu_available:
+        parser = HfArgumentParser(
+            (ModelArguments, DataArguments, TrainingArguments, FinetuningArguments)
+        )
+    else:
+        from optimum.habana import GaudiTrainingArguments
+
+        parser = HfArgumentParser(
+            (ModelArguments, DataArguments, GaudiTrainingArguments, FinetuningArguments)
+        )
+
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, finetune_args = parser.parse_json_file(
+            json_file=os.path.abspath(sys.argv[1])
+        )
+    else:
+        (
+            model_args,
+            data_args,
+            training_args,
+            finetune_args,
+        ) = parser.parse_args_into_dataclasses()
+
+    finetune_cfg = TextGenerationFinetuningConfig(
+        model_args=model_args,
+        data_args=data_args,
+        training_args=training_args,
+        finetune_args=finetune_args,
+    )
+    finetune_model(finetune_cfg)
+
+if __name__ == "__main__":
+    main()
diff --git a/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/requirements.txt b/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/requirements.txt
@@ -0,0 +1,15 @@
+datasets
+einops
+evaluate
+fastapi
+nltk
+peft
+pydub
+python-multipart
+rouge_score
+sentencepiece
+shortuuid
+torch==2.2.0
+transformers
+uvicorn
+yacs
diff --git a/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/run_finetuning.sh b/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation/run_finetuning.sh
@@ -0,0 +1,48 @@
+# Copyright (c) 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export KMP_BLOCKTIME=1
+export KMP_SETTINGS=1
+export KMP_AFFINITY=granularity=fine,compact,1,0
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+
+python finetune_clm.py \
+        --model_name_or_path "EleutherAI/gpt-j-6B" \
+        --bf16 True \
+        --dataset_name  "glue" \
+        --dataset_config_name "mnli" \
+        --dataset_concatenation \
+        --config_name ./config.json \
+        --per_device_train_batch_size 8 \
+        --per_device_eval_batch_size 8 \
+        --gradient_accumulation_steps 1 \
+        --do_train \
+        --do_eval \
+        --learning_rate 3.3113761e-4 \
+        --num_train_epochs 3 \
+        --logging_steps 100 \
+        --save_total_limit 2 \
+        --overwrite_output_dir \
+        --log_level info \
+        --save_strategy epoch \
+        --output_dir ./gptj_peft_finetuned_model \
+        --peft lora \
+        --lora_alpha 54 \
+        --lora_target_modules q_proj v_proj k_proj out_proj \
+        --use_fast_tokenizer false \
+        --use_cpu \
+        --task completion \
+        --max_train_samples 5000 \
+        --max_eval_samples  500
diff --git a/intel_extension_for_transformers/neural_chat/prompts/prompt.py b/intel_extension_for_transformers/neural_chat/prompts/prompt.py
@@ -114,6 +114,19 @@
     )
 )
 
+# Glue mnli
+register_conv_template(
+    Conversation(
+        name="glue_mnli",
+        system_message="Glue mnli.",
+        roles=("premise", "hypothesis", "label"),
+        messages=(),
+        offset=0,
+        sep_style=SeparatorStyle.ROBIN,
+        sep="\n",
+    )
+)
+
 # Summarization template
 register_conv_template(
     Conversation(
diff --git a/intel_extension_for_transformers/transformers/llm/finetuning/data_utils.py b/intel_extension_for_transformers/transformers/llm/finetuning/data_utils.py
@@ -63,6 +63,9 @@ def __init__(self, dataset_name):
         elif "stack-exchange-instruction" in self.dataset_name:
             self.prompt_template = PromptTemplate("question_answer")
             self.key_role_map = [('question', 0), ('response', 1)]
+        elif "glue" in self.dataset_name:
+            self.prompt_template = PromptTemplate("glue_mnli")
+            self.key_role_map = [('premise', 0), ('hypothesis', 1), ('label', 2)]
         else:
             raise NotImplementedError(
                 f"Unsupported dataset {dataset_name}, "
@@ -86,13 +89,13 @@ def create_data(self, examples):
                     key_role_map = self.key_role_map[0]
 
             for idx, (key, role) in enumerate(key_role_map):
-                message = example[key]
+                message = str(example[key])
                 if idx == len(key_role_map)-1:
                     message = ""
                 prompt_template.append_message(prompt_template.roles[role], message)
             source = prompt_template.get_prompt()
             prompts["source"].append(source)
-            prompts["target"].append(example[key_role_map[-1][0]])
+            prompts["target"].append(str(example[key_role_map[-1][0]]))
             prompt_template.clear_messages()
         return prompts
 
diff --git a/intel_extension_for_transformers/transformers/llm/finetuning/finetuning.py b/intel_extension_for_transformers/transformers/llm/finetuning/finetuning.py