Skip to content

Commit e4b80f9

Browse files
Merge pull request rllm-org#156 from MattFisher/cleanup/class_eval
Follow-up: ClassEval
2 parents a46087e + 4b713a8 commit e4b80f9

File tree

12 files changed

+288
-177
lines changed

12 files changed

+288
-177
lines changed

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,14 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
8383
inspect eval inspect_evals/ds1000
8484
```
8585

86+
- ### [ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation](src/inspect_evals/class_eval)
87+
Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks. GPT-4 and GPT-3.5 outperform other models, with holistic generation being the best strategy for them, while incremental generation works better for other models. The study also shows that the performance of LLMs is highly correlated with the number of tokens in the prompt.
88+
<sub><sup>Contributed by: [@zhenningdavidliu](https://github.com/zhenningdavidliu)</sub></sup>
89+
90+
```bash
91+
inspect eval inspect_evals/class_eval
92+
```
93+
8694
## Assistants
8795

8896
- ### [GAIA: A Benchmark for General AI Assistants](src/inspect_evals/gaia)

src/inspect_evals/_registry.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from .assistant_bench import assistant_bench
1717
from .bbh import bbh
1818
from .boolq import boolq
19+
from .class_eval import class_eval
1920
from .commonsense_qa import commonsense_qa
2021
from .cybench import cybench
2122
from .cybermetric import (
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# ClassEval
2+
3+
[ClassEval](https://github.com/FudanSELab/ClassEval) is a benchmark for evaluating large language models (LLMs) on class-level code generation tasks. It contains 100 class-level Python code generation tasks, constructed over approximately 500 person-hours.
4+
5+
<!-- Contributors: Automatically Generated -->
6+
Contributed by [@zhenningdavidliu](https://github.com/zhenningdavidliu)
7+
<!-- /Contributors: Automatically Generated -->
8+
9+
<!-- Usage: Automatically Generated -->
10+
## Usage
11+
12+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
13+
14+
```bash
15+
pip install inspect_ai
16+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
17+
```
18+
19+
Then, evaluate against one or more models with:
20+
21+
```bash
22+
inspect eval inspect_evals/class_eval --model openai/gpt-4o
23+
```
24+
25+
After running evaluations, you can view their logs using the `inspect view` command:
26+
27+
```bash
28+
inspect view
29+
```
30+
31+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
32+
33+
```bash
34+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
35+
ANTHROPIC_API_KEY=<anthropic-api-key>
36+
```
37+
<!-- /Usage: Automatically Generated -->
38+
39+
>[!NOTE]
40+
>The bash and python code is executed inside a Docker container, so you will need to install [Docker Engine](https://docs.docker.com/engine/install/) in order to run the evaluation.
41+
>
42+
> Docker containers are also used in some challenges for auxiliary services. When you run this, the containers are pulled from a public registry. The source code is still included, however, both for your reference and because some files are copied from them into the agent's container at runtime.
43+
44+
<!-- Options: Automatically Generated -->
45+
## Options
46+
47+
You can control a variety of options from the command line. For example:
48+
49+
```bash
50+
inspect eval inspect_evals/class_eval --limit 10
51+
inspect eval inspect_evals/class_eval --max-connections 10
52+
inspect eval inspect_evals/class_eval --temperature 0.5
53+
```
54+
55+
See `inspect eval --help` for all available options.
56+
<!-- /Options: Automatically Generated -->
57+
58+
## Dataset
59+
60+
See https://huggingface.co/datasets/FudanSELab/ClassEval for more information.
61+
62+
63+
## Scoring
64+
For each sample, the LLM-generated code is appended with test cases from the dataset, and the code is executed to check if it passes all the test cases. If all test cases pass, the sample is considered correct, and the score for the sample is 1. Otherwise, the score for the sample is 0.
65+
66+
The final score for the model is the average score across all samples.
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
from .class_eval import class_eval
1+
from .class_eval import class_eval
2+
3+
__all__ = ["class_eval"]

src/inspect_evals/class_eval/class_eval.py

Lines changed: 26 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,23 +11,22 @@
1111
import re
1212
from typing import Any
1313

14-
from inspect_ai import Task, task, Epochs
14+
from inspect_ai import Epochs, Task, task
1515
from inspect_ai.dataset import Sample, hf_dataset
1616
from inspect_ai.scorer import (
1717
CORRECT,
1818
INCORRECT,
1919
Score,
2020
Scorer,
2121
Target,
22-
scorer,
2322
mean,
23+
scorer,
2424
std,
2525
)
2626
from inspect_ai.solver import (
27-
generate,
28-
Solver,
29-
system_message,
3027
TaskState,
28+
generate,
29+
system_message,
3130
)
3231
from inspect_ai.util import ExecResult, sandbox
3332

@@ -36,22 +35,19 @@
3635
# Timeout for scoring.
3736
VERIFY_TIMEOUT = 30
3837

38+
3939
@task
4040
def class_eval(few_shot: int = 1, few_shot_seed: int = 42) -> Task:
4141
"""Inspect Task implementation of ClassEval.
4242
4343
Args:
44-
k_shot (int): The number of few shots to include.
45-
k_shot_seed (int): The seed for generating few shots.
46-
solver (Solver): The solver to use for this evaluation. Defaults to the default solver.
47-
instruction_prompt (String): The prompt to prepend to the code problem.
48-
scorer (Scorer): The scorer to use for this evaluation. Defaults to the default scorer.
44+
few_shot (int): The number of few shots to include.
45+
few_shot_seed (int): The seed for generating few shots.
4946
"""
50-
5147
dataset = hf_dataset(
5248
path="FudanSELab/ClassEval",
5349
split="test",
54-
sample_fields=record_to_sample
50+
sample_fields=record_to_sample,
5551
)
5652

5753
INSTRUCTION = """
@@ -64,23 +60,23 @@ def class_eval(few_shot: int = 1, few_shot_seed: int = 42) -> Task:
6460
dataset=dataset,
6561
solver=[system_message(INSTRUCTION), generate()],
6662
scorer=class_eval_scorer(),
67-
epochs = Epochs(few_shot, [f"pass_at_{few_shot}"]),
63+
epochs=Epochs(few_shot, [f"pass_at_{few_shot}"]),
6864
sandbox="docker",
6965
)
7066

7167

7268
@scorer(metrics=[mean(), std()])
7369
def class_eval_scorer() -> Scorer:
74-
'''
75-
This is the scorer for class eval. It will first identify the python code in the output of
76-
the model. Then we append the test cases to the code and execute it. If the code passes all the test cases,
77-
we return a CORRECT score. Otherwise, we return an INCORRECT score.
78-
'''
70+
"""
71+
Scorer for class eval
7972
80-
async def score(state:TaskState, target: Target) -> Score:
81-
82-
result = {}
73+
It will first identify the python code in the output of the model.
74+
Then we append the test cases to the code and execute it.
75+
If the code passes all the test cases, we return a CORRECT score.
76+
Otherwise, we return an INCORRECT score.
77+
"""
8378

79+
async def score(state: TaskState, target: Target) -> Score:
8480
generated_code = find_code(state.output.completion)
8581
code = generated_code + "\n" + state.metadata["test"]
8682

@@ -90,7 +86,7 @@ async def score(state:TaskState, target: Target) -> Score:
9086
explanation += "\n```\n"
9187

9288
try:
93-
result = await sandbox().exec(
89+
result: ExecResult[str] = await sandbox().exec(
9490
cmd=["python", "-c", code],
9591
timeout=VERIFY_TIMEOUT,
9692
)
@@ -100,7 +96,9 @@ async def score(state:TaskState, target: Target) -> Score:
10096
else:
10197
explanation += "Code did not pass all test cases.\n"
10298
if result.stderr:
103-
explanation += f"See details below.\n ```python\n {result.stderr} \n ```\n"
99+
explanation += (
100+
f"See details below.\n ```python\n {result.stderr} \n ```\n"
101+
)
104102
except TimeoutError:
105103
result = ExecResult(False, 1, "", "Verification timed out.")
106104
explanation += "Verification timed out."
@@ -126,13 +124,11 @@ def find_code(completion: str) -> str:
126124
def record_to_sample(
127125
record: dict[str, Any],
128126
) -> Sample:
129-
'''
130-
Maps class_eval record into inspect sample
131-
'''
127+
"""Convert a record from the dataset to a Sample"""
132128
return Sample(
133-
input = construct_prompt(record),
134-
target = record["solution_code"],
135-
id = record["task_id"],
129+
input=construct_prompt(record),
130+
target=record["solution_code"],
131+
id=record["task_id"],
136132
metadata={
137133
"task_id": record["task_id"],
138134
"skeleton": record["skeleton"],
@@ -145,5 +141,5 @@ def record_to_sample(
145141
"test_classes": record["test_classes"],
146142
"class_constructor": record["class_constructor"],
147143
"fields": record["fields"],
148-
}
144+
},
149145
)

src/inspect_evals/class_eval/test.py

Lines changed: 0 additions & 37 deletions
This file was deleted.

src/inspect_evals/class_eval/test_data.py

Lines changed: 0 additions & 86 deletions
This file was deleted.

0 commit comments

Comments
 (0)