|
| 1 | +# MuSR |
| 2 | + |
| 3 | +[MuSR](https://arxiv.org/abs/2310.16049) is a dataset designed for evaluating models on multistep soft reasoning tasks in the form of free text narratives. It is composed of 3 domains: murder_mysteries, object_placements, and team_allocation, which have 250, 256, and 250 instances respectively in the dataset. |
| 4 | + |
| 5 | +<!-- Contributors: Automatically Generated --> |
| 6 | +Contributed by [@farrelmahaztra](https://github.com/farrelmahaztra) |
| 7 | +<!-- /Contributors: Automatically Generated --> |
| 8 | + |
| 9 | +<!-- Usage: Automatically Generated --> |
| 10 | +## Usage |
| 11 | + |
| 12 | +First, install the `inspect_ai` and `inspect_evals` Python packages with: |
| 13 | +```bash |
| 14 | +pip install inspect_ai |
| 15 | +pip install git+https://github.com/UKGovernmentBEIS/inspect_evals |
| 16 | +``` |
| 17 | + |
| 18 | +Then, evaluate against one or more models with: |
| 19 | +```bash |
| 20 | +inspect eval inspect_evals/musr --model openai/gpt-4o |
| 21 | +``` |
| 22 | + |
| 23 | +After running evaluations, you can view their logs using the `inspect view` command: |
| 24 | + |
| 25 | +```bash |
| 26 | +inspect view |
| 27 | +``` |
| 28 | + |
| 29 | +If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example: |
| 30 | + |
| 31 | +```bash |
| 32 | +INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620 |
| 33 | +ANTHROPIC_API_KEY=<anthropic-api-key> |
| 34 | +``` |
| 35 | +<!-- /Usage: Automatically Generated --> |
| 36 | + |
| 37 | +<!-- Options: Automatically Generated --> |
| 38 | +## Options |
| 39 | + |
| 40 | +You can control a variety of options from the command line. For example: |
| 41 | +```bash |
| 42 | +inspect eval inspect_evals/musr --limit 10 |
| 43 | +inspect eval inspect_evals/musr --max-connections 10 |
| 44 | +inspect eval inspect_evals/musr --temperature 0.5 |
| 45 | +``` |
| 46 | + |
| 47 | +See `inspect eval --help` for all available options. |
| 48 | +<!-- /Options: Automatically Generated --> |
| 49 | + |
| 50 | +## Dataset |
| 51 | +Here is a truncated example from the dataset: |
| 52 | + |
| 53 | +>The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene. |
| 54 | +> |
| 55 | +>Winston cast his gaze over the club once more—a hub of pulsating rhythms now eerily silent. |
| 56 | +> |
| 57 | +>A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor. |
| 58 | +> |
| 59 | +>With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale. |
| 60 | +> |
| 61 | +>Who is the most likely murderer? |
| 62 | +> |
| 63 | +>Pick one of the following choices: |
| 64 | +>A - Eugene |
| 65 | +>B - Gabrielle |
| 66 | +
|
| 67 | +The model is tasked to answer the question and choose the appropriate option. |
| 68 | + |
| 69 | +## Evaluation |
| 70 | +The prompts are based on the [official MuSR repository](https://github.com/Zayne-sprague/MuSR). The in-built `multiple_choice` scorer is used for evaluation. |
0 commit comments