You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Refactored the metric computation in `_utils.py`:
- Split the `ws_metric` function into three separate functions for better modularity:
- `preprocess_scores`: Preprocesses scores into a DataFrame with computed weights and biases.
- `compute_accuracy`: Computes weighted accuracy from preprocessed scores.
- `compute_bias`: Computes weighted bias from preprocessed scores.
- Updated `ws_accuracy` and `ws_bias` in `worldsense.py` to utilize the new functions.
- Improved documentation:
- Added detailed explanations of problem types and grades in `README.md`, clarifying how `problemname` is formed.
- Included a comprehensive docstring for the `worldsense` task in `worldsense.py`, explaining the task's purpose and usage.
Copy file name to clipboardExpand all lines: src/inspect_evals/worldsense/README.md
+15-4Lines changed: 15 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,20 +4,31 @@
4
4
5
5
6
6
## Dataset
7
-
Here is an example prompt from the dataset:
7
+
Here is an example prompt ("description") from the dataset (`problemname == "Compl.trivial"`):
8
8
9
9
>"Alice is enrolled in 3 courses per week: Alice takes history before computer science and economics before history.
10
10
>Choose one of the following alternatives: (1) Alice takes history in between economics and computer science, (2) Alice takes history outside of the time range between economics and computer science, or (3) it is impossible to decide.
11
11
>Think carefully, and only respond with one of these possible options (1), (2), or (3).
12
12
13
-
The model is then tasked to pick the correct choice.
13
+
There are three problem types:
14
+
15
+
1. "Infer" (inference): Determine whether a given statement about a description is true or false.
16
+
2. "Compl" (completion): Select which of three statements are true about a description, including an option for when it is not possible to decide.
17
+
3. "Consist" (consistency): Determine whether a statement about a description is possible or impossible.
18
+
19
+
In addition there are two grades:
20
+
21
+
1. "trivial": can be solved on statements alone
22
+
2. "normal": requires a world model to solve
23
+
24
+
A problemname is formed by concatenation "<type>.<grade>".
14
25
15
26
## Scoring
16
27
17
28
- Simple accuracy
18
29
- Standard error
19
-
- Weighted accuracy by `tuple_ID`, `problemname`, and `problemsize` (reported in the paper)
20
-
- Weighted bias by `tuple_ID`, `problemname`, and `problemsize` (reported in the paper)
30
+
- Weighted accuracy by `tuple_ID`, `problemname`, and `problemsize` (as reported in the paper)
31
+
- Weighted bias by `tuple_ID`, `problemname`, and `problemsize` (as reported in the paper)
21
32
22
33
In addition to built-in metrics, the main results are weighted accuracy and bias. Here the primary unit is a `tuple_ID` which corresponds to one "description" or scenario above. To this, up to three answer option sets are provided to ensure that all possible correct answers for the options are selected exactly once. Weights are used to average over multiple `tuple_ID`s.
Task for evaluating reasoning related to world descriptions. There are three problem types ("Infer", "Compl", "Consist") and two grades ("trivial", "normal"). A problemname is formed by concatenation "<type>.<grade>". See README for details.
48
+
49
+
Args:
50
+
problemnames (str | list[str], optional): A string or list of strings specifying the names of problems or tasks to filter the dataset. If provided, it filters the dataset to samples that contain matching metadata for the specified problem names.
51
+
52
+
Returns:
53
+
Task: A task object configured with a dataset filtered by problem names (if specified), a solver, a scoring pattern for evaluating task responses, and custom metrics.
0 commit comments