You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/listing.yaml
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@
58
58
59
59
- title: "Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models"
60
60
description: |
61
-
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
61
+
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
- title: "GSM8K: Training Verifiers to Solve Math Word Problems"
120
120
description: |
121
-
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.
121
+
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demonstrates fewshot prompting.
122
122
path: src/inspect_evals/gsm8k
123
123
arxiv: https://arxiv.org/abs/2110.14168
124
124
group: Mathematics
@@ -183,7 +183,7 @@
183
183
184
184
- title: "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens"
185
185
description: |
186
-
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
186
+
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
0 commit comments