Update usage doc

Elfsong · Elfsong · commit a90b5d6021d9 · 2024-05-28T16:26:59.000Z
diff --git a/docs/README.md b/docs/README.md
@@ -404,6 +404,28 @@ accelerate launch main.py \
   --allow_code_execution
 ```
 
+## Mercury
+[Mercury](https://huggingface.co/datasets/Elfsong/Mercury) is a Code-LLM computational efficiency benchmark. It comprises 1,889 Python programming tasks with three difficulty stratification, which is divided into two datasets for model evaluation and fine-tuning separately. For each evaluation task, we assign a test case generator to remedy the shortfall of test case coverage. More details can be found in the [paper](https://arxiv.org/abs/2402.07844).
+
+```shell
+# Install these libraries before runing Mercury
+pip install lctk sortedcontainers
+```
+
+```python
+accelerate launch main.py  \
+    --model <MODEL_NAME>   \
+    --load_in_4bit   \
+    --max_length_generation 2048   \
+    --tasks mercury    \
+    --n_samples 5  \
+    --temperature 0.2  \
+    --batch_size 5   \
+    --allow_code_execution  \
+    --save_generations  \
+    --metric_output_path <MODEL_NAME>.json
+```
+
 ## Code generation benchmarks without unit tests
 
 For these tasks, we do single generations and compare the generated code against reference solutions and compute BLEU score. For the following tasks, we use a two-shot setting where we include 2 inputs and their solutions in the prompt, all preceded by an instruction such as: ` "Answer the following instructions in a one line SQL query:\n"`. The solutions consist of one line so we stop the generation when a new line is generated. 3 languages are present: Python, SQL and Java.