Skip to content
This repository was archived by the owner on Oct 25, 2024. It is now read-only.

Commit 5e35e47

Browse files
mini-goelpre-commit-ci[bot]VincyZhang
authored
Add text-gen finetune workflow for glue mnli (#1478)
* Add text-gen finetune workflow for glue mnli Signed-off-by: mini-goel <104451502+mini-goel@users.noreply.github.com> --------- Signed-off-by: mini-goel <104451502+mini-goel@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: VincyZhang <wenxin.zhang@intel.com>
1 parent 4fe1913 commit 5e35e47

File tree

8 files changed

+337
-11
lines changed

8 files changed

+337
-11
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# GPT-J fine-tuning and inference
2+
3+
4+
1. [Introduction](#introduction)
5+
2. [Get Started](#get-started)
6+
7+
# Introduction
8+
9+
GPT-J 6B is an open-source large language model (LLM) with 6B parameters. Like GPT-3, it is an autoregressive, decoder-only transformer model designed to solve natural language processing (NLP) tasks by predicting how a piece of text will continue from a prompt. It is pre-trained on the Pile dataset, using the Mesh Transformer JAX library in JAX to handle the parallelization scheme with a vocab size of 50257 tokens.
10+
11+
This example demonstrates an end-end LLM fine-tuning workflow using [Glue MNLI](https://huggingface.co/datasets/glue/viewer/mnli/train) dataset and [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) technique to optimize the fine-tuning time.
12+
13+
This workflow runs on Intel Xeon CPU platform and has been verified on Intel 4th Gen Xeon CPU.
14+
15+
### Dataset
16+
The MNLI dataset consists of pairs of sentences, a premise and a hypothesis. The task is to predict the relation between the premise and the hypothesis, which can be:
17+
Entailment: the premise entails the hypothesis,
18+
Contradiction: hypothesis contradicts the premise and
19+
Neutral: hypothesis and premise are unrelated. The prompt given to the model combines both premise and hypothesis and the fine-tuned model generates/predicts the relation as a next token.
20+
21+
### Fine-tuning and Inference
22+
GPTJ-6B model is finetuned for NLI (Natural Language Inference) task using Glue MNLI Dataset. LoRA (PEFT) technique is applied which decomposes the LLM's weight matrices into smaller, lower-rank matrices called LoRA adapters. This significantly reduces the number of trainable parameters and thus the fine-tuning time without impacting the performance.
23+
The fine-tuned model is then evaluated using Hugging face pipeline API using 'validation' dataset achieving SOTA accuracy.
24+
25+
26+
# Get Started
27+
28+
### 1. Download the Workflow Repository
29+
30+
Clone the GPT-J repo for fine-tuning
31+
```
32+
git clone https://github.com/intel/intel-extension-for-transformers/tree/main
33+
cd intel_extension_for_transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/text_generation
34+
```
35+
36+
### 2. Create environment and install software packages
37+
38+
Create miniconda environment and install the required packages
39+
```
40+
conda create -n gptj_ft_env python=3.10
41+
conda activate gptj_ft_env
42+
pip install -r requirements.txt
43+
```
44+
45+
Install the following for best performance:
46+
```
47+
conda install mkl mkl-include -y
48+
conda install gperftools jemalloc==5.2.1 -c conda-forge -y
49+
```
50+
51+
### 3. Prepare dataset
52+
We use the [Glue MNLI](https://huggingface.co/datasets/glue/viewer/mnli/train) dataset from hugging face.
53+
54+
## Fine-tuning
55+
56+
### Single-node fine-tuning
57+
Set the following:
58+
```
59+
export KMP_BLOCKTIME=1
60+
export KMP_SETTINGS=1
61+
export KMP_AFFINITY=granularity=fine,compact,1,0
62+
```
63+
64+
Set libiomp
65+
```
66+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
67+
```
68+
69+
Tcmalloc is a recommended malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
70+
```
71+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
72+
```
73+
74+
Run command or use singlenode_runscript.sh
75+
76+
We also supported Distributed Data Parallel finetuning on single node and multi nodes settings.
77+
Below command is to run single node fine-tuning.
78+
79+
80+
```
81+
python finetune_clm.py \
82+
--model_name_or_path "EleutherAI/gpt-j-6B" \
83+
--bf16 True \
84+
--dataset_name "glue" \
85+
--dataset_config_name, "mnli" \
86+
--dataset_concatenation \
87+
--config_name ./config.json \
88+
--per_device_train_batch_size 8 \
89+
--per_device_eval_batch_size 8 \
90+
--gradient_accumulation_steps 1 \
91+
--do_train \
92+
--do_eval \
93+
--learning_rate 3.3113761e-4 \
94+
--num_train_epochs 3 \
95+
--logging_steps 100 \
96+
--save_total_limit 2 \
97+
--overwrite_output_dir \
98+
--log_level info \
99+
--save_strategy epoch \
100+
--output_dir ./gptj_peft_finetuned_model \
101+
--peft lora \
102+
--lora_alpha 54 \
103+
--lora_target_modules q_proj,v_proj,k_proj,out_proj \
104+
--use_fast_tokenizer false \
105+
--use_cpu \
106+
--task completion \
107+
--max_train_samples 5000 \
108+
--max_eval_samples 500
109+
110+
```
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
{
2+
"activation_function": "gelu_new",
3+
"architectures": [
4+
"GPTJForCausalLM"
5+
],
6+
"attn_pdrop": 0.0,
7+
"bos_token_id": 50256,
8+
"embd_pdrop": 0.0,
9+
"eos_token_id": 50256,
10+
"gradient_checkpointing": false,
11+
"initializer_range": 0.02,
12+
"layer_norm_epsilon": 1e-05,
13+
"model_type": "gptj",
14+
"n_embd": 4096,
15+
"n_head": 16,
16+
"n_inner": null,
17+
"n_layer": 28,
18+
"n_positions": 2048,
19+
"resid_pdrop": 0.0,
20+
"rotary": true,
21+
"rotary_dim": 64,
22+
"scale_attn_weights": true,
23+
"summary_activation": null,
24+
"summary_first_dropout": 0.1,
25+
"summary_proj_to_labels": true,
26+
"summary_type": "cls_index",
27+
"summary_use_proj": true,
28+
"task_specific_params": {
29+
"text-generation": {
30+
"do_sample": true,
31+
"max_length": 250,
32+
"temperature": 0.2,
33+
"top_p": 0.4,
34+
"top_k": 70
35+
}
36+
},
37+
"tie_word_embeddings": false,
38+
"tokenizer_class": "GPT2Tokenizer",
39+
"transformers_version": "4.18.0.dev0",
40+
"use_cache": true,
41+
"vocab_size": 50400
42+
}
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# !/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
#
4+
# Copyright (c) 2023 Intel Corporation
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
# You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
18+
import os
19+
import sys
20+
from transformers import TrainingArguments, HfArgumentParser
21+
from intel_extension_for_transformers.neural_chat.config import (
22+
ModelArguments,
23+
DataArguments,
24+
FinetuningArguments,
25+
TextGenerationFinetuningConfig,
26+
)
27+
from intel_extension_for_transformers.neural_chat.chatbot import finetune_model
28+
from intel_extension_for_transformers.neural_chat.utils.common import is_hpu_available
29+
30+
def main():
31+
# See all possible arguments in src/transformers/training_args.py
32+
# or by passing the --help flag to this script.
33+
# We now keep distinct sets of args, for a cleaner separation of concerns.
34+
if not is_hpu_available:
35+
parser = HfArgumentParser(
36+
(ModelArguments, DataArguments, TrainingArguments, FinetuningArguments)
37+
)
38+
else:
39+
from optimum.habana import GaudiTrainingArguments
40+
41+
parser = HfArgumentParser(
42+
(ModelArguments, DataArguments, GaudiTrainingArguments, FinetuningArguments)
43+
)
44+
45+
if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
46+
# If we pass only one argument to the script and it's the path to a json file,
47+
# let's parse it to get our arguments.
48+
model_args, data_args, training_args, finetune_args = parser.parse_json_file(
49+
json_file=os.path.abspath(sys.argv[1])
50+
)
51+
else:
52+
(
53+
model_args,
54+
data_args,
55+
training_args,
56+
finetune_args,
57+
) = parser.parse_args_into_dataclasses()
58+
59+
finetune_cfg = TextGenerationFinetuningConfig(
60+
model_args=model_args,
61+
data_args=data_args,
62+
training_args=training_args,
63+
finetune_args=finetune_args,
64+
)
65+
finetune_model(finetune_cfg)
66+
67+
if __name__ == "__main__":
68+
main()
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
datasets
2+
einops
3+
evaluate
4+
fastapi
5+
nltk
6+
peft
7+
pydub
8+
python-multipart
9+
rouge_score
10+
sentencepiece
11+
shortuuid
12+
torch==2.2.0
13+
transformers
14+
uvicorn
15+
yacs
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Copyright (c) 2024 Intel Corporation
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
export KMP_BLOCKTIME=1
16+
export KMP_SETTINGS=1
17+
export KMP_AFFINITY=granularity=fine,compact,1,0
18+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so
19+
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
20+
21+
python finetune_clm.py \
22+
--model_name_or_path "EleutherAI/gpt-j-6B" \
23+
--bf16 True \
24+
--dataset_name "glue" \
25+
--dataset_config_name "mnli" \
26+
--dataset_concatenation \
27+
--config_name ./config.json \
28+
--per_device_train_batch_size 8 \
29+
--per_device_eval_batch_size 8 \
30+
--gradient_accumulation_steps 1 \
31+
--do_train \
32+
--do_eval \
33+
--learning_rate 3.3113761e-4 \
34+
--num_train_epochs 3 \
35+
--logging_steps 100 \
36+
--save_total_limit 2 \
37+
--overwrite_output_dir \
38+
--log_level info \
39+
--save_strategy epoch \
40+
--output_dir ./gptj_peft_finetuned_model \
41+
--peft lora \
42+
--lora_alpha 54 \
43+
--lora_target_modules q_proj v_proj k_proj out_proj \
44+
--use_fast_tokenizer false \
45+
--use_cpu \
46+
--task completion \
47+
--max_train_samples 5000 \
48+
--max_eval_samples 500

intel_extension_for_transformers/neural_chat/prompts/prompt.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,19 @@
114114
)
115115
)
116116

117+
# Glue mnli
118+
register_conv_template(
119+
Conversation(
120+
name="glue_mnli",
121+
system_message="Glue mnli.",
122+
roles=("premise", "hypothesis", "label"),
123+
messages=(),
124+
offset=0,
125+
sep_style=SeparatorStyle.ROBIN,
126+
sep="\n",
127+
)
128+
)
129+
117130
# Summarization template
118131
register_conv_template(
119132
Conversation(

intel_extension_for_transformers/transformers/llm/finetuning/data_utils.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ def __init__(self, dataset_name):
6363
elif "stack-exchange-instruction" in self.dataset_name:
6464
self.prompt_template = PromptTemplate("question_answer")
6565
self.key_role_map = [('question', 0), ('response', 1)]
66+
elif "glue" in self.dataset_name:
67+
self.prompt_template = PromptTemplate("glue_mnli")
68+
self.key_role_map = [('premise', 0), ('hypothesis', 1), ('label', 2)]
6669
else:
6770
raise NotImplementedError(
6871
f"Unsupported dataset {dataset_name}, "
@@ -86,13 +89,13 @@ def create_data(self, examples):
8689
key_role_map = self.key_role_map[0]
8790

8891
for idx, (key, role) in enumerate(key_role_map):
89-
message = example[key]
92+
message = str(example[key])
9093
if idx == len(key_role_map)-1:
9194
message = ""
9295
prompt_template.append_message(prompt_template.roles[role], message)
9396
source = prompt_template.get_prompt()
9497
prompts["source"].append(source)
95-
prompts["target"].append(example[key_role_map[-1][0]])
98+
prompts["target"].append(str(example[key_role_map[-1][0]]))
9699
prompt_template.clear_messages()
97100
return prompts
98101

0 commit comments

Comments
 (0)