Add README

farrelmahaztra · farrelmahaztra · commit d94463e2f586 · 2024-12-22T05:47:48.000+07:00
diff --git a/README.md b/README.md
@@ -295,6 +295,12 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    inspect eval inspect_evals/ifeval
    ```
 
+- ### [MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning](src/inspect_evals/musr)
+  Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
+ <sub><sup>Contributed by: [@farrelmahaztra](https://github.com/farrelmahaztra)</sub></sup>
+   ```
+   inspect eval inspect_evals/musr
+   ```
 
 - ### [PAWS: Paraphrase Adversaries from Word Scrambling](src/inspect_evals/paws)
   Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.
diff --git a/src/inspect_evals/musr/README.md b/src/inspect_evals/musr/README.md
@@ -0,0 +1,70 @@
+# MuSR
+
+[MuSR](https://arxiv.org/abs/2310.16049) is a dataset designed for evaluating models on multistep soft reasoning tasks in the form of free text narratives. It is composed of 3 domains: murder_mysteries, object_placements, and team_allocation, which have 250, 256, and 250 instances respectively in the dataset.
+
+<!-- Contributors: Automatically Generated -->
+Contributed by [@farrelmahaztra](https://github.com/farrelmahaztra)
+<!-- /Contributors: Automatically Generated -->
+
+<!-- Usage: Automatically Generated -->
+## Usage
+
+First, install the `inspect_ai` and `inspect_evals` Python packages with:
+```bash
+pip install inspect_ai
+pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
+```
+
+Then, evaluate against one or more models with:
+```bash
+inspect eval inspect_evals/musr --model openai/gpt-4o
+```
+
+After running evaluations, you can view their logs using the `inspect view` command:
+
+```bash
+inspect view
+```
+
+If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
+
+```bash
+INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
+ANTHROPIC_API_KEY=<anthropic-api-key>
+```
+<!-- /Usage: Automatically Generated -->
+
+<!-- Options: Automatically Generated -->
+## Options
+
+You can control a variety of options from the command line. For example:
+```bash
+inspect eval inspect_evals/musr --limit 10
+inspect eval inspect_evals/musr --max-connections 10
+inspect eval inspect_evals/musr --temperature 0.5
+```
+
+See `inspect eval --help` for all available options.
+<!-- /Options: Automatically Generated -->
+
+## Dataset
+Here is a truncated example from the dataset:
+
+>The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene.
+>
+>Winston cast his gaze over the club once more—a hub of pulsating rhythms now eerily silent.
+>
+>A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor.
+>
+>With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale.
+>
+>Who is the most likely murderer?
+>
+>Pick one of the following choices:
+>A - Eugene
+>B - Gabrielle 
+
+The model is tasked to answer the question and choose the appropriate option.
+
+## Evaluation
+The prompts are based on the [official MuSR repository](https://github.com/Zayne-sprague/MuSR). The in-built `multiple_choice` scorer is used for evaluation.
diff --git a/tools/listing.yaml b/tools/listing.yaml
@@ -262,6 +262,15 @@
   contributors: ["adil-a"]
   tasks: ["ifeval"]
 
+- title: "MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning"
+  description: |
+    Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
+  path: src/inspect_evals/musr
+  arxiv: https://arxiv.org/abs/2310.16049
+  group: Reasoning
+  contributors: ["farrelmahaztra"]
+  tasks: ["musr"]
+
 - title: "PAWS: Paraphrase Adversaries from Word Scrambling"
   description: |
     Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.