Skip to content

Commit 2d7dcc4

Browse files
Cleanup and slight refactor
1 parent 2a52d1e commit 2d7dcc4

File tree

1 file changed

+36
-25
lines changed

1 file changed

+36
-25
lines changed

src/inspect_evals/musr/musr.py

Lines changed: 36 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
https://arxiv.org/abs/2310.16049
55
66
# Example: Eval MuSR with team_allocation and CoT+
7-
inspect eval musr.py -T domain=team_allocation prompt_technique=cot+
7+
inspect eval musr.py -T domain=team_allocation -T prompt_technique=cot+
88
99
"""
1010

@@ -58,8 +58,8 @@
5858
Who is the most likely murderer?
5959
6060
Pick one of the following choices:
61-
1 - Eugene
62-
2 - Gabrielle
61+
A - Eugene
62+
B - Gabrielle
6363
6464
You must pick one option. Before selecting a choice, explain your reasoning step by step. The murderer needs to have a means (access to weapon), motive (reason to kill the victim), and opportunity (access to crime scene) in order to have killed the victim. Innocent suspects may have two of these proven, but not all three. An innocent suspect may be suspicious for some other reason, but they will not have all of motive, means, and opportunity established.
6565
@@ -90,7 +90,7 @@
9090
9191
Therefore, Gabrielle is the most likely murderer.
9292
93-
ANSWER: 2
93+
ANSWER: B
9494
9595
9696
""".strip()
@@ -122,10 +122,10 @@
122122
Which location is the most likely place Clara would look to find the glass jar given the story?
123123
124124
Pick one of the following choices:
125-
1 - dining table
126-
2 - kitchen table
127-
3 - pantry
128-
4 - under counter
125+
A - dining table
126+
B - kitchen table
127+
C - pantry
128+
D - under counter
129129
130130
You must pick one option. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
131131
@@ -139,7 +139,7 @@
139139
140140
Now, given the story and evidence, we can assume that Clara believes the glass jar to be in the pantry.
141141
142-
ANSWER: 3
142+
ANSWER: C
143143
144144
"""
145145

@@ -159,9 +159,9 @@
159159
Given the story, how would you uniquely allocate each person to make sure both tasks are accomplished efficiently?
160160
161161
Pick one of the following choices:
162-
1 - Teaching: Travis, Maintenance: Angela and Greg
163-
2 - Teaching: Greg, Maintenance: Angela and Travis
164-
3 - Teaching: Angela, Maintenance: Greg and Travis
162+
A - Teaching: Travis, Maintenance: Angela and Greg
163+
B - Teaching: Greg, Maintenance: Angela and Travis
164+
C - Teaching: Angela, Maintenance: Greg and Travis
165165
166166
You must pick one option. The story should allow you to determine how good each person is at a skill. Roughly, each person is either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks that uses their skills as well as possible. In addition, one task will have to have two people assigned to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the overall quality of the assignment.
167167
@@ -201,13 +201,13 @@
201201
202202
Now, let's find the best assignment.
203203
204-
Option 1: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
205-
Option 2: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly toghether (1) = 4
206-
Option 3: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
204+
Option A: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
205+
Option B: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly toghether (1) = 4
206+
Option C: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
207207
208-
So, from this, we can see Option 3 has the maximum score.
208+
So, from this, we can see Option C has the maximum score.
209209
210-
ANSWER: 3
210+
ANSWER: C
211211
212212
"""
213213

@@ -272,7 +272,7 @@
272272
def get_domain_prompt(
273273
domain: Literal["murder_mysteries", "object_placements", "team_allocation"],
274274
prompt_technique: Literal["regular", "cot", "cot+"],
275-
use_example: bool,
275+
example_count: int,
276276
) -> str:
277277
domain_info = {
278278
"murder_mysteries": {
@@ -289,15 +289,26 @@ def get_domain_prompt(
289289
},
290290
}
291291

292+
if domain not in domain_info:
293+
raise ValueError(
294+
"Unknown domain. Valid domains are murder_mysteries (default), object_placements, and team_allocation"
295+
)
296+
292297
if prompt_technique == "regular":
293298
prompt = REGULAR_PROMPT
294299
elif prompt_technique == "cot":
295300
prompt = COT_PROMPT
296-
else:
301+
elif prompt_technique == "cot+":
297302
prompt = COT_PLUS_PROMPT.format(hint=domain_info[domain]["hint"])
298-
299-
if use_example:
300-
return f'Here is an example of solving the task:\n\n{domain_info[domain]["hint"]}\n\nThis is the end of the example. The real task is below.\n\n---\n\n{prompt}'
303+
else:
304+
raise ValueError(
305+
"Unknown prompt technique. Valid prompt techniques are regular (default), cot, and cot+."
306+
)
307+
308+
if example_count > 1:
309+
raise ValueError(">1 examples currently not supported")
310+
if example_count == 1:
311+
return f'Here is an example of solving the task:\n\n{domain_info[domain]["example"]}\n\nThis is the end of the example. The real task is below.\n\n---\n\n{prompt}'
301312
else:
302313
return prompt
303314

@@ -316,7 +327,7 @@ def musr(
316327
"murder_mysteries", "object_placements", "team_allocation"
317328
] = "murder_mysteries",
318329
prompt_technique: Literal["regular", "cot", "cot+"] = "regular",
319-
use_example: bool = False,
330+
example_count: int = 0,
320331
) -> Task:
321332
"""Inspect task implementing the MuSR benchmark.
322333
@@ -325,9 +336,9 @@ def musr(
325336
Defaults to "murder_mysteries".
326337
prompt_technique (Literal["regular", "cot", "cot+"]): The prompt technique to use. "regular" includes only the narrative
327338
and the question. "cot" uses chain-of-thought prompting. "cot+" includes a hint. Defaults to "regular".
328-
use_example (bool): Include a solved example at the beginning of each prompt.
339+
example_count (int): Number of solved examples to include at the beginning of each prompt. Defaults to 0. Currently only supports 1 example.
329340
"""
330-
prompt = get_domain_prompt(domain, prompt_technique, use_example)
341+
prompt = get_domain_prompt(domain, prompt_technique, example_count)
331342

332343
dataset = hf_dataset(
333344
path="TAUR-Lab/MuSR",

0 commit comments

Comments
 (0)