Skip to content

Commit 9cfa52b

Browse files
authored
Merge pull request #187 from ganler/heplus
Add humaneval+ evaluation task
2 parents 3910745 + b62e7a6 commit 9cfa52b

File tree

4 files changed

+76
-2
lines changed

4 files changed

+76
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Below are the features and tasks of this framework:
2626
- We provide Multi-GPU text generation with `accelerate` and Dockerfiles for evaluating on Docker containers for security and reproducibility.
2727

2828
- Tasks:
29-
- 5 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp) and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
29+
- 6 code generation **Python** tasks (with unit tests): [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp) and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode.
3030
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack) extends HumanEval to **3** scenarios across **6** languages via human translations and was released with [OctoPack](https://arxiv.org/abs/2308.07124).
3131
- [MultiPL-E](https://github.com/nuprl/MultiPL-E) evaluation suite (HumanEval translated into **18** programming languages).
3232
- [Recode](https://github.com/amazon-science/recode/tree/main) applied to the HumanEval benchmark. It evaluates the robustness of code-generation models.

bigcode_eval/tasks/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
from pprint import pprint
33

44
from . import (apps, codexglue_code_to_text, codexglue_text_to_text, conala,
5-
concode, ds1000, gsm, humaneval, humanevalpack,
5+
concode, ds1000, gsm, humaneval, humanevalplus, humanevalpack,
66
instruct_humaneval, instruct_wizard_humaneval, mbpp, multiple,
77
parity, python_bugs, quixbugs, recode, santacoder_fim)
88

@@ -16,6 +16,7 @@
1616
"concode": concode.Concode,
1717
**ds1000.create_all_tasks(),
1818
**humaneval.create_all_tasks(),
19+
**humanevalplus.create_all_tasks(),
1920
**humanevalpack.create_all_tasks(),
2021
"mbpp": mbpp.MBPP,
2122
"parity": parity.Parity,
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
2+
https://openreview.net/forum?id=1qvx610Cu7
3+
4+
The HumanEval+ dataset is created by the EvalPlus framework which extends the original HumanEval dataset
5+
by adding more automatically generated test cases to each problem.
6+
7+
Homepage: https://github.com/evalplus/evalplus
8+
"""
9+
10+
from warnings import warn
11+
12+
from bigcode_eval.tasks.humaneval import GeneralHumanEval
13+
14+
_CITATION = """
15+
@inproceedings{evalplus,
16+
title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation},
17+
author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming},
18+
booktitle = {Thirty-seventh Conference on Neural Information Processing Systems},
19+
year = {2023},
20+
url = {https://openreview.net/forum?id=1qvx610Cu7},
21+
}
22+
"""
23+
24+
25+
class GeneralHumanEvalPlus(GeneralHumanEval):
26+
"""A task represents an entire benchmark including its dataset, problems,
27+
answers, generation settings and evaluation methods.
28+
"""
29+
30+
DATASET_PATH = "evalplus/humanevalplus"
31+
32+
def __init__(self, strip_prompt, k=[1, 10, 100], num_workers=16, timeout=10.0):
33+
if timeout < 10.0:
34+
warn(
35+
"It is suggested to have a longer timeout as HumanEval+ has lots of tests. "
36+
f"The current timeout is {timeout}s while the suggested timeout is 10s."
37+
)
38+
super().__init__(strip_prompt, k, num_workers, timeout)
39+
40+
41+
def create_task(strip_prompt):
42+
class HumanEvalPlus(GeneralHumanEvalPlus):
43+
def __init__(self, **kwargs):
44+
super().__init__(strip_prompt, **kwargs)
45+
46+
return HumanEvalPlus
47+
48+
49+
def create_all_tasks():
50+
"""Creates a dictionary of tasks from a list of levels
51+
:return: {task_name: task}
52+
e.g. {multiple-py: Task, multiple-java: Task}
53+
"""
54+
return {
55+
"humanevalplus": create_task(True),
56+
"humanevalplus-unstripped": create_task(False),
57+
}

docs/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,22 @@ accelerate launch main.py \
4949

5050
If you want to evaluate only on the first $n$ samples instead of all the test dataset, set `limit` argument to $n$.
5151

52+
### HumanEval+
53+
[HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus): HumanEval with additional unit tests (80x of the original HumanEval) for each of the 164 problems.
54+
55+
The generation and evaluation follows the same approach as [HumanEval](#humaneval). One only needs to change the task name to `humanevalplus` to run the evaluation on HumanEval+, such as:
56+
57+
```python
58+
accelerate launch main.py \
59+
--model <MODEL_NAME> \
60+
--max_length_generation <MAX_LENGTH> \
61+
--tasks humanevalplus \
62+
--temperature 0.2 \
63+
--n_samples 200 \
64+
--batch_size 10 \
65+
--allow_code_execution
66+
```
67+
5268

5369
### HumanEvalPack
5470

0 commit comments

Comments
 (0)