Update README.md

ArmelRandy · web-flow · commit 5d263767a5ac · 2023-05-03T14:26:09.000+02:00
diff --git a/README.md b/README.md
@@ -7,10 +7,8 @@
 1. [Fine-tuning](#fine-tuning)
   - [Step by step installation with conda](#step-by-step-installation-with-conda)
   - [Datasets](#datasets)
-    - [Code Alpaca](#code-alpaca-ca)
     - [Stack Exchange](#stack-exchange-se)
   - [Merging PEFT adapter layers](#merging-peft-adapter-layers)
-  - [Example outputs](#example-outputs)
 # Fine-tuning
 
 Here, we showcase how we can fine-tune this LM on a specific downstream task.
@@ -67,45 +65,7 @@ Now that everything is done, you can clone the repository and get into the corre
 ## Datasets
 💫 StarCoder can be fine-tuned to achieve multiple downstream tasks. Our interest here is to fine-tune StarCoder in order to make it follow instructions. [Instruction fine-tuning](https://arxiv.org/pdf/2109.01652.pdf) has gained a lot of attention recently as it proposes a simple framework that teaches language models to align their outputs with human needs. That procedure requires the availability of quality instruction datasets, which contain multiple `instruction - answer` pairs. Unfortunately such datasets are not ubiquitous but thanks to Hugging Face 🤗's [datasets](https://github.com/huggingface/datasets) library we can have access to some good proxies. To fine-tune cheaply and efficiently, we use Hugging Face 🤗's [PEFT](https://github.com/huggingface/peft) as well as Tim Dettmers' [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
 
-### Code Alpaca CA
-[Code Alpaca](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) is a dataset of about 20K `prompt - completion` pairs generated by the technique presented in the [self-instruct](https://arxiv.org/abs/2212.10560) paper. Each prompt describes a task that is asked by a user and the corresponding completion is the answer to that task as generated by `text-davinci-003`.
 
-To execute the fine-tuning script run the following command:
-```bash
-python finetune/finetune.py \
-  --model_path="bigcode/large-model"\
-  --dataset_name="HuggingFaceH4/CodeAlpaca_20K"\
-  --seq_length 2048\
-  --max_steps 2000\
-  --batch_size 1\
-  --input_column_name="prompt"\
-  --output_column_name="completion"\ 
-  --gradient_accumulation_steps 16\
-  --learning_rate 5e-6\
-  --lr_scheduler_type="linear"\
-  --num_warmup_steps 100\
-  --weight_decay 0.05\
-  --output_dir="./checkpoints" \
-```
-The size of the model makes the fine-tuning intractable in an environment without GPUs. The problem remains even with the use of PEFT. To launch the training on multiple GPUs use the following command (we just add ```python -m torch.distributed.launch --nproc_per_node number_of_gpus```):
-
-```bash
-python -m torch.distributed.launch \
-  --nproc_per_node number_of_gpus finetune/finetune.py \
-  --model_path="bigcode/large-model"\
-  --dataset_name="HuggingFaceH4/CodeAlpaca_20K"\
-  --seq_length 2048\
-  --max_steps 2000\
-  --batch_size 1\
-  --input_column_name="prompt"\
-  --output_column_name="completion"\ 
-  --gradient_accumulation_steps 16\
-  --learning_rate 5e-6\
-  --lr_scheduler_type="linear"\
-  --num_warmup_steps 100\
-  --weight_decay 0.05\
-  --output_dir="./checkpoints" \
-```
 ### Stack Exchange SE
 [Stack Exchange](https://en.wikipedia.org/wiki/Stack_Exchange) is a well-known network of Q&A websites on topics in diverse fields. It is a place where a user can ask a question and obtain answers from other users. Those answers are scored and ranked based on their quality. [Stack exchange instruction](https://huggingface.co/datasets/ArmelR/stack-exchange-instruction) is a dataset that was obtained by scrapping the site in order to build a collection of Q&A pairs. A language model can then be fine-tuned on that dataset to make it elicit strong and diverse question-answering skills.
 
@@ -167,65 +127,19 @@ For example
 python finetune/merge_peft_adapters.py --model_name_or_path bigcode/large-model --peft_model_path checkpoints/checkpoint-1000 --push_to_hub
 ```
 
-## Example outputs
-
-These are some examples of StarCoder fine-tuned on Code Alpaca.
-
-**Instruction**: Write a function to compute the geometric mean of a list of numbers.
-
-**StarCoder's answer**:
-
-```python
-def geometric_mean(nums):
-    product = 1
-    for num in nums:
-        product *= num
-    return product**(1/len(nums))
-```
-
-**Instruction**:
-What is the purpose of the following code?
+## How to do text-generation with StarCoder
 
 ```python
-def function(n):
-    s = ""
-    while n > 0 :
-        s = str(n%3)+s
-        n = n//3
-    return s
-```
+from transformers import AutoModelForCausalLM, AutoTokenizer
 
-**StarCoder's answer**:
-```
-The purpose of the code is to convert a given number n to a string of its base 3 representation. This is done by dividing the number by 3 and converting the remainder to a string and then adding it to the string of the previous iteration. This process is repeated until the number is 0.
-```
-
-**Instruction**:
-Translate this function in Java :
+checkpoint = "bigcode/large-model"
+device = "cuda" # for GPU usage or "cpu" for CPU usage
 
-```python
-def function(a, b, c):
-    if a > b :
-        print("You lost.")
-    else :
-        if c > a :
-            print("Nice try.")
-        else :
-            print("You won.")
-```
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
 
-**StarCoder's answer**:
-
-```java
-public static void function(int a, int b, int c) {
-    if (a > b) {
-        System.out.println("You lost.");
-    } else {
-        if (c > a) {
-            System.out.println("Nice try.");
-        } else {
-            System.out.println("You won.");
-        }
-    }
-}
+inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
+outputs = model.generate(inputs)
+print(tokenizer.decode(outputs[0]))
 ```
+## Text-inference