Skip to content

Commit d94463e

Browse files
Add README
1 parent 77a1531 commit d94463e

File tree

3 files changed

+85
-0
lines changed

3 files changed

+85
-0
lines changed

README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,12 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
295295
inspect eval inspect_evals/ifeval
296296
```
297297

298+
- ### [MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning](src/inspect_evals/musr)
299+
Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
300+
<sub><sup>Contributed by: [@farrelmahaztra](https://github.com/farrelmahaztra)</sub></sup>
301+
```
302+
inspect eval inspect_evals/musr
303+
```
298304

299305
- ### [PAWS: Paraphrase Adversaries from Word Scrambling](src/inspect_evals/paws)
300306
Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.

src/inspect_evals/musr/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# MuSR
2+
3+
[MuSR](https://arxiv.org/abs/2310.16049) is a dataset designed for evaluating models on multistep soft reasoning tasks in the form of free text narratives. It is composed of 3 domains: murder_mysteries, object_placements, and team_allocation, which have 250, 256, and 250 instances respectively in the dataset.
4+
5+
<!-- Contributors: Automatically Generated -->
6+
Contributed by [@farrelmahaztra](https://github.com/farrelmahaztra)
7+
<!-- /Contributors: Automatically Generated -->
8+
9+
<!-- Usage: Automatically Generated -->
10+
## Usage
11+
12+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
13+
```bash
14+
pip install inspect_ai
15+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
16+
```
17+
18+
Then, evaluate against one or more models with:
19+
```bash
20+
inspect eval inspect_evals/musr --model openai/gpt-4o
21+
```
22+
23+
After running evaluations, you can view their logs using the `inspect view` command:
24+
25+
```bash
26+
inspect view
27+
```
28+
29+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
30+
31+
```bash
32+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
33+
ANTHROPIC_API_KEY=<anthropic-api-key>
34+
```
35+
<!-- /Usage: Automatically Generated -->
36+
37+
<!-- Options: Automatically Generated -->
38+
## Options
39+
40+
You can control a variety of options from the command line. For example:
41+
```bash
42+
inspect eval inspect_evals/musr --limit 10
43+
inspect eval inspect_evals/musr --max-connections 10
44+
inspect eval inspect_evals/musr --temperature 0.5
45+
```
46+
47+
See `inspect eval --help` for all available options.
48+
<!-- /Options: Automatically Generated -->
49+
50+
## Dataset
51+
Here is a truncated example from the dataset:
52+
53+
>The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene.
54+
>
55+
>Winston cast his gaze over the club once more—a hub of pulsating rhythms now eerily silent.
56+
>
57+
>A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor.
58+
>
59+
>With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale.
60+
>
61+
>Who is the most likely murderer?
62+
>
63+
>Pick one of the following choices:
64+
>A - Eugene
65+
>B - Gabrielle
66+
67+
The model is tasked to answer the question and choose the appropriate option.
68+
69+
## Evaluation
70+
The prompts are based on the [official MuSR repository](https://github.com/Zayne-sprague/MuSR). The in-built `multiple_choice` scorer is used for evaluation.

tools/listing.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,15 @@
262262
contributors: ["adil-a"]
263263
tasks: ["ifeval"]
264264

265+
- title: "MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning"
266+
description: |
267+
Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
268+
path: src/inspect_evals/musr
269+
arxiv: https://arxiv.org/abs/2310.16049
270+
group: Reasoning
271+
contributors: ["farrelmahaztra"]
272+
tasks: ["musr"]
273+
265274
- title: "PAWS: Paraphrase Adversaries from Word Scrambling"
266275
description: |
267276
Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.

0 commit comments

Comments
 (0)