Skip to content

Commit 00967d1

Browse files
authored
Merge pull request #190 from ganler/mbppplus
Add mbpp+ evaluation task
2 parents 9cfa52b + 5d4bc98 commit 00967d1

File tree

4 files changed

+112
-3
lines changed

4 files changed

+112
-3
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Below are the features and tasks of this framework:
2626
- We provide Multi-GPU text generation with `accelerate` and Dockerfiles for evaluating on Docker containers for security and reproducibility.
2727

2828
- Tasks:
29-
- 6 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp) and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
29+
- 7 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
3030
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) extends HumanEval to **3** scenarios across **6** languages via human translations and was released with [OctoPack](https://arxiv.org/abs/2308.07124).
3131
- [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages).
3232
- [Recode](https://github.com/amazon-science/recode/tree/main) applied to the HumanEval benchmark. It evaluates the robustness of code-generation models.

bigcode_eval/tasks/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@
33

44
from . import (apps, codexglue_code_to_text, codexglue_text_to_text, conala,
55
concode, ds1000, gsm, humaneval, humanevalplus, humanevalpack,
6-
instruct_humaneval, instruct_wizard_humaneval, mbpp, multiple,
7-
parity, python_bugs, quixbugs, recode, santacoder_fim)
6+
instruct_humaneval, instruct_wizard_humaneval, mbpp, mbppplus,
7+
multiple, parity, python_bugs, quixbugs, recode, santacoder_fim)
88

99
TASK_REGISTRY = {
1010
**apps.create_all_tasks(),
@@ -19,6 +19,7 @@
1919
**humanevalplus.create_all_tasks(),
2020
**humanevalpack.create_all_tasks(),
2121
"mbpp": mbpp.MBPP,
22+
"mbppplus": mbppplus.MBPPPlus,
2223
"parity": parity.Parity,
2324
"python_bugs": python_bugs.PythonBugs,
2425
"quixbugs": quixbugs.QuixBugs,

bigcode_eval/tasks/mbppplus.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
"""Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
2+
https://openreview.net/forum?id=1qvx610Cu7
3+
4+
The MBPP+ dataset is created by the EvalPlus framework which extends the original MBPP dataset
5+
by adding more automatically generated test cases to each problem. Note MBPP+ only includes 399
6+
tasks which are a subset of the original MBPP dataset. The subset is selected from the sanitized
7+
MBPP (a subset of manually examined tasks by the original MBPP authors) and EvalPlus further
8+
removes low-quality and ill-formed tasks for benchmark quality control.
9+
10+
Homepage: https://github.com/evalplus/evalplus
11+
"""
12+
13+
import os
14+
15+
from bigcode_eval.tasks.mbpp import MBPP
16+
from bigcode_eval.tasks.custom_metrics.code_eval import compute_code_eval
17+
18+
_CITATION = """
19+
@inproceedings{evalplus,
20+
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
21+
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
22+
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
23+
year = {2023},
24+
url = {https://openreview.net/forum?id=1qvx610Cu7},
25+
}
26+
"""
27+
28+
29+
class MBPPPlus(MBPP):
30+
"""A task represents an entire benchmark including its dataset, problems,
31+
answers, generation settings and evaluation methods.
32+
"""
33+
34+
DATASET_PATH = "evalplus/mbppplus"
35+
36+
def get_prompt(self, doc):
37+
"""Builds the prompt for the LM to generate from.
38+
MBPP prompt is built following to InCoder (Fried et al.) approach
39+
prompt = docstring that includes one test
40+
"""
41+
description = doc["prompt"] # sanitized testset use "prompt" instead of "text"
42+
test_example = doc["test_list"][0]
43+
prompt = f'"""\n{description}\n{test_example}\n"""\n'
44+
return prompt
45+
46+
# NOTE(@ganler): MBPP+ extends the original MBPP jsonl data with a "test" field which
47+
# includes the testing code ready for execution. Note the "test" field
48+
# is different from HumanEval(+) which further requires a `check` func
49+
def get_reference(self, doc):
50+
"""Builds the reference solution for the doc (sample from the test dataset)."""
51+
use_mbpp_tests = os.getenv("MBBPPLUS_USE_MBPP_TESTS", "0")
52+
if use_mbpp_tests == "1":
53+
return "\n".join(doc["test_list"])
54+
return "\n" + doc["test"]
55+
56+
def get_dataset(self):
57+
"""Returns dataset for the task or an iterable of any object, that get_prompt can handle"""
58+
dataset = self.dataset["test"]
59+
assert (
60+
len(dataset) == 399
61+
), "MBPP+ only has 399 problems. Please retry by deleting its old cache"
62+
return dataset
63+
64+
def process_results(self, generations, references):
65+
"""Takes the list of LM generations and evaluates them against ground truth references,
66+
returning the metric for the generations.
67+
:param generations: list(list(str))
68+
list of lists containing generations
69+
:param references: list(str)
70+
list of str containing refrences
71+
"""
72+
results, _ = compute_code_eval(
73+
references=references,
74+
predictions=generations,
75+
timeout=10.0, # 10s timeout
76+
)
77+
return results

docs/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,37 @@ accelerate launch main.py \
206206

207207
Low temperatures generally work better for small $k$ in pass@k.
208208

209+
### MBPP+
210+
[MBPP+](https://huggingface.co/datasets/evalplus/mbppplus): MBPP with additional unit tests (35x of the original MBPP) for each of the 164 problems.
211+
212+
The generation and evaluation follows the same approach as [MBPP](#mbpp). One only needs to change the task name to `mbppplus` to run the evaluation on MBPP+, such as:
213+
214+
> [!Note]
215+
> Note MBPP+ only includes **399** tasks which are a subset of the original MBPP dataset (~1000 tasks).
216+
> The subset is selected from the sanitized MBPP (a subset of ~427 manually examined tasks by the original MBPP authors)
217+
> and EvalPlus further removes low-quality and ill-formed one for benchmark quality control to get MBPP+.
218+
219+
```bash
220+
accelerate launch main.py \
221+
--model <MODEL_NAME> \
222+
--max_length_generation <MAX_LENGTH> \
223+
--tasks mbppplus \
224+
--temperature 0.1 \
225+
--n_samples 15 \
226+
--batch_size 10 \
227+
--allow_code_execution
228+
```
229+
230+
By setting `MBBPPLUS_USE_MBPP_TESTS=1` when running MBPP+, one can run the 399 MBPP+ tasks (a subset of the 500 MBPP evaluation tasks) with the original MBPP base tests:
231+
232+
```bash
233+
MBBPPLUS_USE_MBPP_TESTS=1 accelerate launch main.py \
234+
--tasks mbppplus \
235+
--allow_code_execution \
236+
--load_generations_path generations_mbppplus.json \
237+
--model <MODEL_NAME>
238+
```
239+
209240
### DS-1000
210241
[DS-1000](https://ds1000-code-gen.github.io/): Code generation benchmark with 1000 data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
211242

0 commit comments

Comments
 (0)