Cleanup and slight refactor

farrelmahaztra · farrelmahaztra · commit 2d7dcc44ba26 · 2024-12-22T05:26:16.000+07:00
diff --git a/src/inspect_evals/musr/musr.py b/src/inspect_evals/musr/musr.py
@@ -4,7 +4,7 @@
 https://arxiv.org/abs/2310.16049
 
 # Example: Eval MuSR with team_allocation and CoT+
-inspect eval musr.py -T domain=team_allocation prompt_technique=cot+
+inspect eval musr.py -T domain=team_allocation -T prompt_technique=cot+
 
 """
 
@@ -58,8 +58,8 @@
 Who is the most likely murderer?
 
 Pick one of the following choices:
-1 - Eugene
-2 - Gabrielle
+A - Eugene
+B - Gabrielle
 
 You must pick one option. Before selecting a choice, explain your reasoning step by step. The murderer needs to have a means (access to weapon), motive (reason to kill the victim), and opportunity (access to crime scene) in order to have killed the victim. Innocent suspects may have two of these proven, but not all three. An innocent suspect may be suspicious for some other reason, but they will not have all of motive, means, and opportunity established.
 
@@ -90,7 +90,7 @@
 
 Therefore, Gabrielle is the most likely murderer.
 
-ANSWER: 2
+ANSWER: B
 
 
 """.strip()
@@ -122,10 +122,10 @@
 Which location is the most likely place Clara would look to find the glass jar given the story?
 
 Pick one of the following choices:
-1 - dining table
-2 - kitchen table
-3 - pantry
-4 - under counter
+A - dining table
+B - kitchen table
+C - pantry
+D - under counter
 
 You must pick one option. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
 
@@ -139,7 +139,7 @@
 
 Now, given the story and evidence, we can assume that Clara believes the glass jar to be in the pantry.
 
-ANSWER: 3
+ANSWER: C
 
 """
 
@@ -159,9 +159,9 @@
 Given the story, how would you uniquely allocate each person to make sure both tasks are accomplished efficiently?
 
 Pick one of the following choices:
-1 - Teaching: Travis, Maintenance: Angela and Greg
-2 - Teaching: Greg, Maintenance: Angela and Travis
-3 - Teaching: Angela, Maintenance: Greg and Travis
+A - Teaching: Travis, Maintenance: Angela and Greg
+B - Teaching: Greg, Maintenance: Angela and Travis
+C - Teaching: Angela, Maintenance: Greg and Travis
 
 You must pick one option. The story should allow you to determine how good each person is at a skill. Roughly, each person is either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks that uses their skills as well as possible. In addition, one task will have to have two people assigned to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the overall quality of the assignment.
 
@@ -201,13 +201,13 @@
 
 Now, let's find the best assignment.
 
-Option 1: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
-Option 2: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly toghether (1) = 4
-Option 3: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
+Option A: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
+Option B: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly toghether (1) = 4
+Option C: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
 
-So, from this, we can see Option 3 has the maximum score.
+So, from this, we can see Option C has the maximum score.
 
-ANSWER: 3
+ANSWER: C
 
 """
 
@@ -272,7 +272,7 @@
 def get_domain_prompt(
     domain: Literal["murder_mysteries", "object_placements", "team_allocation"],
     prompt_technique: Literal["regular", "cot", "cot+"],
-    use_example: bool,
+    example_count: int,
 ) -> str:
     domain_info = {
         "murder_mysteries": {
@@ -289,15 +289,26 @@ def get_domain_prompt(
         },
     }
 
+    if domain not in domain_info:
+        raise ValueError(
+            "Unknown domain. Valid domains are murder_mysteries (default), object_placements, and team_allocation"
+        )
+
     if prompt_technique == "regular":
         prompt = REGULAR_PROMPT
     elif prompt_technique == "cot":
         prompt = COT_PROMPT
-    else:
+    elif prompt_technique == "cot+":
         prompt = COT_PLUS_PROMPT.format(hint=domain_info[domain]["hint"])
-
-    if use_example:
-        return f'Here is an example of solving the task:\n\n{domain_info[domain]["hint"]}\n\nThis is the end of the example. The real task is below.\n\n---\n\n{prompt}'
+    else:
+        raise ValueError(
+            "Unknown prompt technique. Valid prompt techniques are regular (default), cot, and cot+."
+        )
+
+    if example_count > 1:
+        raise ValueError(">1 examples currently not supported")
+    if example_count == 1:
+        return f'Here is an example of solving the task:\n\n{domain_info[domain]["example"]}\n\nThis is the end of the example. The real task is below.\n\n---\n\n{prompt}'
     else:
         return prompt
 
@@ -316,7 +327,7 @@ def musr(
         "murder_mysteries", "object_placements", "team_allocation"
     ] = "murder_mysteries",
     prompt_technique: Literal["regular", "cot", "cot+"] = "regular",
-    use_example: bool = False,
+    example_count: int = 0,
 ) -> Task:
     """Inspect task implementing the MuSR benchmark.
 
@@ -325,9 +336,9 @@ def musr(
             Defaults to "murder_mysteries".
         prompt_technique (Literal["regular", "cot", "cot+"]): The prompt technique to use. "regular" includes only the narrative
             and the question. "cot" uses chain-of-thought prompting. "cot+" includes a hint. Defaults to "regular".
-        use_example (bool): Include a solved example at the beginning of each prompt.
+        example_count (int): Number of solved examples to include at the beginning of each prompt. Defaults to 0. Currently only supports 1 example.
     """
-    prompt = get_domain_prompt(domain, prompt_technique, use_example)
+    prompt = get_domain_prompt(domain, prompt_technique, example_count)
 
     dataset = hf_dataset(
         path="TAUR-Lab/MuSR",