|
7 | 7 | 1. [Fine-tuning](#fine-tuning) |
8 | 8 | - [Step by step installation with conda](#step-by-step-installation-with-conda) |
9 | 9 | - [Datasets](#datasets) |
10 | | - - [Code Alpaca](#code-alpaca-ca) |
11 | 10 | - [Stack Exchange](#stack-exchange-se) |
12 | 11 | - [Merging PEFT adapter layers](#merging-peft-adapter-layers) |
13 | | - - [Example outputs](#example-outputs) |
14 | 12 | # Fine-tuning |
15 | 13 |
|
16 | 14 | Here, we showcase how we can fine-tune this LM on a specific downstream task. |
@@ -67,45 +65,7 @@ Now that everything is done, you can clone the repository and get into the corre |
67 | 65 | ## Datasets |
68 | 66 | 💫 StarCoder can be fine-tuned to achieve multiple downstream tasks. Our interest here is to fine-tune StarCoder in order to make it follow instructions. [Instruction fine-tuning](https://arxiv.org/pdf/2109.01652.pdf) has gained a lot of attention recently as it proposes a simple framework that teaches language models to align their outputs with human needs. That procedure requires the availability of quality instruction datasets, which contain multiple `instruction - answer` pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face 🤗's [datasets](https://github.com/huggingface/datasets) library we can have access to some good proxies. To fine-tune cheaply and efficiently, we use Hugging Face 🤗's [PEFT](https://github.com/huggingface/peft) as well as Tim Dettmers' [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). |
69 | 67 |
|
70 | | -### Code Alpaca CA |
71 | | -[Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) is a dataset of about 20K `prompt - completion` pairs generated by the technique presented in the [self-instruct](https://arxiv.org/abs/2212.10560) paper. Each prompt describes a task that is asked by a user and the corresponding completion is the answer to that task as generated by `text-davinci-003`. |
72 | 68 |
|
73 | | -To execute the fine-tuning script run the following command: |
74 | | -```bash |
75 | | -python finetune/finetune.py \ |
76 | | - --model_path="bigcode/large-model"\ |
77 | | - --dataset_name="HuggingFaceH4/CodeAlpaca_20K"\ |
78 | | - --seq_length 2048\ |
79 | | - --max_steps 2000\ |
80 | | - --batch_size 1\ |
81 | | - --input_column_name="prompt"\ |
82 | | - --output_column_name="completion"\ |
83 | | - --gradient_accumulation_steps 16\ |
84 | | - --learning_rate 5e-6\ |
85 | | - --lr_scheduler_type="linear"\ |
86 | | - --num_warmup_steps 100\ |
87 | | - --weight_decay 0.05\ |
88 | | - --output_dir="./checkpoints" \ |
89 | | -``` |
90 | | -The size of the model makes the fine-tuning intractable in an environment without GPUs. The problem remains even with the use of PEFT. To launch the training on multiple GPUs use the following command (we just add ```python -m torch.distributed.launch --nproc_per_node number_of_gpus```): |
91 | | - |
92 | | -```bash |
93 | | -python -m torch.distributed.launch \ |
94 | | - --nproc_per_node number_of_gpus finetune/finetune.py \ |
95 | | - --model_path="bigcode/large-model"\ |
96 | | - --dataset_name="HuggingFaceH4/CodeAlpaca_20K"\ |
97 | | - --seq_length 2048\ |
98 | | - --max_steps 2000\ |
99 | | - --batch_size 1\ |
100 | | - --input_column_name="prompt"\ |
101 | | - --output_column_name="completion"\ |
102 | | - --gradient_accumulation_steps 16\ |
103 | | - --learning_rate 5e-6\ |
104 | | - --lr_scheduler_type="linear"\ |
105 | | - --num_warmup_steps 100\ |
106 | | - --weight_decay 0.05\ |
107 | | - --output_dir="./checkpoints" \ |
108 | | -``` |
109 | 69 | ### Stack Exchange SE |
110 | 70 | [Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange) is a well-known network of Q&A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. [Stack exchange instruction](https://huggingface.co/datasets/ArmelR/stack-exchange-instruction) is a dataset that was obtained by scrapping the site in order to build a collection of Q&A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills. |
111 | 71 |
|
@@ -167,65 +127,19 @@ For example |
167 | 127 | python finetune/merge_peft_adapters.py --model_name_or_path bigcode/large-model --peft_model_path checkpoints/checkpoint-1000 --push_to_hub |
168 | 128 | ``` |
169 | 129 |
|
170 | | -## Example outputs |
171 | | - |
172 | | -These are some examples of StarCoder fine-tuned on Code Alpaca. |
173 | | - |
174 | | -**Instruction**: Write a function to compute the geometric mean of a list of numbers. |
175 | | - |
176 | | -**StarCoder's answer**: |
177 | | - |
178 | | -```python |
179 | | -def geometric_mean(nums): |
180 | | - product = 1 |
181 | | - for num in nums: |
182 | | - product *= num |
183 | | - return product**(1/len(nums)) |
184 | | -``` |
185 | | - |
186 | | -**Instruction**: |
187 | | -What is the purpose of the following code? |
| 130 | +## How to do text-generation with StarCoder |
188 | 131 |
|
189 | 132 | ```python |
190 | | -def function(n): |
191 | | - s = "" |
192 | | - while n > 0 : |
193 | | - s = str(n%3)+s |
194 | | - n = n//3 |
195 | | - return s |
196 | | -``` |
| 133 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
197 | 134 |
|
198 | | -**StarCoder's answer**: |
199 | | -``` |
200 | | -The purpose of the code is to convert a given number n to a string of its base 3 representation. This is done by dividing the number by 3 and converting the remainder to a string and then adding it to the string of the previous iteration. This process is repeated until the number is 0. |
201 | | -``` |
202 | | - |
203 | | -**Instruction**: |
204 | | -Translate this function in Java : |
| 135 | +checkpoint = "bigcode/large-model" |
| 136 | +device = "cuda" # for GPU usage or "cpu" for CPU usage |
205 | 137 |
|
206 | | -```python |
207 | | -def function(a, b, c): |
208 | | - if a > b : |
209 | | - print("You lost.") |
210 | | - else : |
211 | | - if c > a : |
212 | | - print("Nice try.") |
213 | | - else : |
214 | | - print("You won.") |
215 | | -``` |
| 138 | +tokenizer = AutoTokenizer.from_pretrained(checkpoint) |
| 139 | +model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device) |
216 | 140 |
|
217 | | -**StarCoder's answer**: |
218 | | - |
219 | | -```java |
220 | | -public static void function(int a, int b, int c) { |
221 | | - if (a > b) { |
222 | | - System.out.println("You lost."); |
223 | | - } else { |
224 | | - if (c > a) { |
225 | | - System.out.println("Nice try."); |
226 | | - } else { |
227 | | - System.out.println("You won."); |
228 | | - } |
229 | | - } |
230 | | -} |
| 141 | +inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) |
| 142 | +outputs = model.generate(inputs) |
| 143 | +print(tokenizer.decode(outputs[0])) |
231 | 144 | ``` |
| 145 | +## Text-inference |
0 commit comments