You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,11 +25,12 @@ You will also need to install any packages required to interact with the models
25
25
export OPENAI_API_KEY=<openai-api-key>
26
26
pip install openai
27
27
```
28
-
Furthermore, some of the evaluations require additional dependencies. If your eval needs extra dependency, instructions for installing them are provided the [list of evals](#list-of-evals). subsection (or the README for that evaluation). For example, to install the dependencies of `SWE-Bench` evaluation you should run:
28
+
29
+
Furthermore, some of the evaluations require additional dependencies. If your eval needs extra dependency, instructions for installing them are provided the [list of evals](#list-of-evals) subsection (or the README for that evaluation). For example, to install the dependencies of `SWE-Bench` evaluation you should run:
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working direcotry that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
42
+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
42
43
43
44
```bash
44
45
INSPECT_EVAL_MODEL=openai/gpt-4o
45
46
OPENAI_API_KEY=<openai-api-key>
46
47
```
47
48
48
-
Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, AzureAI, AWS Bedrock, TogetherAI, Groq, HuggingFace, vLLM, Ollama, and more. See the [Model Providers](https://inspect.ai-safety-institute.org.uk/models.html) documentation for additional details.
49
+
Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, Ollama, and more. See the [Model Providers](https://inspect.ai-safety-institute.org.uk/models.html) documentation for additional details.
Copy file name to clipboardExpand all lines: tools/listing.yaml
+3-4Lines changed: 3 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -58,7 +58,7 @@
58
58
59
59
- title: "Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models"
60
60
description: |
61
-
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
61
+
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
- title: "GSM8K: Training Verifiers to Solve Math Word Problems"
120
120
description: |
121
-
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.
121
+
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demonstrates fewshot prompting.
122
122
path: src/inspect_evals/gsm8k
123
123
arxiv: https://arxiv.org/abs/2110.14168
124
124
group: Mathematics
@@ -180,10 +180,9 @@
180
180
contributors: ["seddy-aisi"]
181
181
tasks: ["piqa"]
182
182
183
-
184
183
- title: "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens"
185
184
description: |
186
-
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
185
+
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
0 commit comments