Skip to content

Commit 599994f

Browse files
Merge pull request rllm-org#146 from farrelmahaztra/musr
MuSR Benchmark Implementation
2 parents e4b80f9 + 33cb405 commit 599994f

File tree

7 files changed

+464
-0
lines changed

7 files changed

+464
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -349,6 +349,13 @@ The questions were generated by GPT-4 based on the "Computer Systems Security: P
349349
inspect eval inspect_evals/ifeval
350350
```
351351

352+
- ### [MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning](src/inspect_evals/musr)
353+
Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
354+
<sub><sup>Contributed by: [@farrelmahaztra](https://github.com/farrelmahaztra)</sub></sup>
355+
```
356+
inspect eval inspect_evals/musr
357+
```
358+
352359
- ### [PAWS: Paraphrase Adversaries from Word Scrambling](src/inspect_evals/paws)
353360
Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.
354361
<sub><sup>Contributed by: [@meltemkenis](https://github.com/meltemkenis)</sub></sup>

src/inspect_evals/_registry.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@
5353
from .mmlu import mmlu_0_shot, mmlu_5_shot
5454
from .mmlu_pro import mmlu_pro
5555
from .mmmu import mmmu_multiple_choice, mmmu_open
56+
from .musr import musr
5657
from .paws import paws
5758
from .piqa import piqa
5859
from .pubmedqa import pubmedqa

src/inspect_evals/musr/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# MuSR
2+
3+
[MuSR](https://arxiv.org/abs/2310.16049) is a dataset designed for evaluating models on multistep soft reasoning tasks in the form of free text narratives. It is composed of 3 domains: murder_mysteries, object_placements, and team_allocation, which have 250, 256, and 250 instances respectively in the dataset.
4+
5+
<!-- Contributors: Automatically Generated -->
6+
Contributed by [@farrelmahaztra](https://github.com/farrelmahaztra)
7+
<!-- /Contributors: Automatically Generated -->
8+
9+
<!-- Usage: Automatically Generated -->
10+
## Usage
11+
12+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
13+
```bash
14+
pip install inspect_ai
15+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
16+
```
17+
18+
Then, evaluate against one or more models with:
19+
```bash
20+
inspect eval inspect_evals/musr --model openai/gpt-4o
21+
```
22+
23+
After running evaluations, you can view their logs using the `inspect view` command:
24+
25+
```bash
26+
inspect view
27+
```
28+
29+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
30+
31+
```bash
32+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
33+
ANTHROPIC_API_KEY=<anthropic-api-key>
34+
```
35+
<!-- /Usage: Automatically Generated -->
36+
37+
<!-- Options: Automatically Generated -->
38+
## Options
39+
40+
You can control a variety of options from the command line. For example:
41+
```bash
42+
inspect eval inspect_evals/musr --limit 10
43+
inspect eval inspect_evals/musr --max-connections 10
44+
inspect eval inspect_evals/musr --temperature 0.5
45+
```
46+
47+
See `inspect eval --help` for all available options.
48+
<!-- /Options: Automatically Generated -->
49+
50+
## Dataset
51+
Here is a truncated example from the dataset:
52+
53+
>The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene.
54+
>
55+
>Winston cast his gaze over the club once more—a hub of pulsating rhythms now eerily silent.
56+
>
57+
>A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor.
58+
>
59+
>With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale.
60+
>
61+
>Who is the most likely murderer?
62+
>
63+
>Pick one of the following choices:
64+
>A - Eugene
65+
>B - Gabrielle
66+
67+
The model is tasked to answer the question and choose the appropriate option.
68+
69+
## Evaluation
70+
The prompts are based on the [official MuSR repository](https://github.com/Zayne-sprague/MuSR). The in-built `multiple_choice` scorer is used for evaluation.

src/inspect_evals/musr/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .musr import musr
2+
3+
__all__ = ["musr"]

src/inspect_evals/musr/musr.py

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
"""MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning
2+
3+
Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett.
4+
https://arxiv.org/abs/2310.16049
5+
6+
# Example: Eval MuSR with team_allocation and CoT+
7+
inspect eval musr.py -T domain=team_allocation -T prompt_technique=cot+
8+
9+
"""
10+
11+
import ast
12+
from typing import Any, Dict, Literal
13+
14+
from inspect_ai import Task, task
15+
from inspect_ai.dataset import Sample, hf_dataset
16+
from inspect_ai.scorer import choice
17+
from inspect_ai.solver import multiple_choice, system_message
18+
19+
from inspect_evals.musr.prompts import (
20+
COT_PLUS_PROMPT,
21+
COT_PROMPT,
22+
MURDER_MYSTERIES_EXAMPLE,
23+
MURDER_MYSTERIES_HINT,
24+
OBJECT_PLACEMENTS_EXAMPLE,
25+
OBJECT_PLACEMENTS_HINT,
26+
REGULAR_PROMPT,
27+
SYSTEM_PROMPT,
28+
TEAM_ALLOCATION_EXAMPLE,
29+
TEAM_ALLOCATION_HINT,
30+
)
31+
32+
DomainType = Literal["murder_mysteries", "object_placements", "team_allocation"]
33+
PromptTechniqueType = Literal["regular", "cot", "cot+"]
34+
35+
DEFAULT_DOMAIN: DomainType = "murder_mysteries"
36+
DEFAULT_PROMPT_TECHNIQUE: PromptTechniqueType = "regular"
37+
DEFAULT_EXAMPLE_COUNT = 0
38+
39+
40+
@task
41+
def musr(
42+
domain: DomainType = DEFAULT_DOMAIN,
43+
prompt_technique: PromptTechniqueType = DEFAULT_PROMPT_TECHNIQUE,
44+
example_count: int = DEFAULT_EXAMPLE_COUNT,
45+
) -> Task:
46+
"""Inspect task implementing the MuSR benchmark.
47+
48+
Args:
49+
domain (Literal["murder_mysteries", "object_placements", "team_allocation"]): Which domain in the dataset to evaluate.
50+
Defaults to "murder_mysteries".
51+
prompt_technique (Literal["regular", "cot", "cot+"]): The prompt technique to use. "regular" includes only the narrative
52+
and the question. "cot" uses chain-of-thought prompting. "cot+" includes a hint. Defaults to "regular".
53+
example_count (int): Number of solved examples to include at the beginning of each prompt. Defaults to 0. Currently only supports 1 example.
54+
"""
55+
prompt = get_domain_prompt(domain, prompt_technique, example_count)
56+
57+
dataset = hf_dataset(
58+
path="TAUR-Lab/MuSR",
59+
split=domain,
60+
sample_fields=record_to_sample,
61+
shuffle=True,
62+
auto_id=True,
63+
)
64+
65+
return Task(
66+
dataset=dataset,
67+
solver=[
68+
system_message(SYSTEM_PROMPT),
69+
multiple_choice(template=prompt),
70+
],
71+
scorer=choice(),
72+
)
73+
74+
75+
def get_domain_prompt(
76+
domain: DomainType = DEFAULT_DOMAIN,
77+
prompt_technique: PromptTechniqueType = DEFAULT_PROMPT_TECHNIQUE,
78+
example_count: int = DEFAULT_EXAMPLE_COUNT,
79+
) -> str:
80+
domain_info = {
81+
"murder_mysteries": {
82+
"hint": MURDER_MYSTERIES_HINT,
83+
"example": MURDER_MYSTERIES_EXAMPLE,
84+
},
85+
"object_placements": {
86+
"hint": OBJECT_PLACEMENTS_HINT,
87+
"example": OBJECT_PLACEMENTS_EXAMPLE,
88+
},
89+
"team_allocation": {
90+
"hint": TEAM_ALLOCATION_HINT,
91+
"example": TEAM_ALLOCATION_EXAMPLE,
92+
},
93+
}
94+
95+
if domain not in domain_info:
96+
raise ValueError(
97+
"Unknown domain. Valid domains are murder_mysteries (default), object_placements, and team_allocation"
98+
)
99+
100+
if prompt_technique == "regular":
101+
prompt = REGULAR_PROMPT
102+
elif prompt_technique == "cot":
103+
prompt = COT_PROMPT
104+
elif prompt_technique == "cot+":
105+
prompt = COT_PLUS_PROMPT.format(hint=domain_info[domain]["hint"])
106+
else:
107+
raise ValueError(
108+
"Unknown prompt technique. Valid prompt techniques are regular (default), cot, and cot+."
109+
)
110+
111+
if example_count > 1:
112+
raise ValueError(">1 examples currently not supported")
113+
if example_count == 1:
114+
return f'Here is an example of solving the task:\n\n{domain_info[domain]["example"]}\n\nThis is the end of the example. The real task is below.\n\n---\n\n{prompt}'
115+
else:
116+
return prompt
117+
118+
119+
def record_to_sample(record: Dict[str, Any]) -> Sample:
120+
return Sample(
121+
input=f'{record["narrative"]}\n\n{record["question"]}',
122+
choices=ast.literal_eval(record["choices"]),
123+
target=chr(ord("A") + int(record["answer_index"])),
124+
)

0 commit comments

Comments
 (0)