Skip to content

Commit ca01674

Browse files
authored
Merge pull request rllm-org#121 from bact/add-sh-annotation-markdown
Fix typos and Markdown warnings in README
2 parents 63b9137 + 7a614ab commit ca01674

File tree

3 files changed

+14
-10
lines changed

3 files changed

+14
-10
lines changed

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,12 @@ You will also need to install any packages required to interact with the models
2525
export OPENAI_API_KEY=<openai-api-key>
2626
pip install openai
2727
```
28-
Furthermore, some of the evaluations require additional dependencies. If your eval needs extra dependency, instructions for installing them are provided the [list of evals](#list-of-evals). subsection (or the README for that evaluation). For example, to install the dependencies of `SWE-Bench` evaluation you should run:
28+
29+
Furthermore, some of the evaluations require additional dependencies. If your eval needs extra dependency, instructions for installing them are provided the [list of evals](#list-of-evals) subsection (or the README for that evaluation). For example, to install the dependencies of `SWE-Bench` evaluation you should run:
2930

3031
```bash
3132
pip install "inspect_evals[swe_bench] @ git+https://github.com/UKGovernmentBEIS/inspect_evals"
32-
pip install -e ".[swe_bench]" # If developing on the pacakge locally
33+
pip install -e ".[swe_bench]" # If developing on the package locally
3334
```
3435

3536
Once you have a model configured, you can run evaluations for it with:
@@ -38,14 +39,14 @@ Once you have a model configured, you can run evaluations for it with:
3839
inspect eval inspect_evals/gpqa_diamond --model openai/gpt-4o
3940
```
4041

41-
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working direcotry that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
42+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
4243

4344
```bash
4445
INSPECT_EVAL_MODEL=openai/gpt-4o
4546
OPENAI_API_KEY=<openai-api-key>
4647
```
4748

48-
Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, AzureAI, AWS Bedrock, TogetherAI, Groq, HuggingFace, vLLM, Ollama, and more. See the [Model Providers](https://inspect.ai-safety-institute.org.uk/models.html) documentation for additional details.
49+
Inspect supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, Ollama, and more. See the [Model Providers](https://inspect.ai-safety-institute.org.uk/models.html) documentation for additional details.
4950

5051
# List of Evals
5152
<!-- Eval Listing: Automatically Generated -->

tools/listing.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,14 @@ def listing_md(listing: dict[str, Any]) -> str:
3030
output: list[str] = []
3131
output.append(f"- ### {link_md(listing['title'], os.path.join(listing['path']))}")
3232
output.append(f" {listing['description']}{contributors}")
33-
output.append(" ```")
33+
output.append("")
34+
output.append(" ```bash")
3435
for index, task in enumerate(listing["tasks"]):
3536
if index > 3:
3637
break
3738
output.append(f" inspect eval inspect_evals/{task}")
3839

39-
output.append(" ```\n")
40+
output.append(" ```")
4041
return "\n".join(output)
4142

4243

@@ -98,6 +99,7 @@ def generate_options(task_metadata: dict[str, Any]) -> None:
9899
contents.append(
99100
"You can control a variety of options from the command line. For example:"
100101
)
102+
contents.append("")
101103
contents.append("```bash")
102104
contents.append(f"inspect eval inspect_evals/{task_names[0]} --limit 10")
103105
contents.append(f"inspect eval inspect_evals/{task_names[1]} --max-connections 10")
@@ -118,6 +120,7 @@ def generate_usage(task_metadata: dict[str, Any]) -> None:
118120
contents.append(
119121
"First, install the `inspect_ai` and `inspect_evals` Python packages with:"
120122
)
123+
contents.append("")
121124
contents.append("```bash")
122125
contents.append("pip install inspect_ai")
123126
if dependency is None:
@@ -132,6 +135,7 @@ def generate_usage(task_metadata: dict[str, Any]) -> None:
132135
contents.append("")
133136

134137
contents.append("Then, evaluate against one or more models with:")
138+
contents.append("")
135139
contents.append("```bash")
136140
for index, task in enumerate(task_metadata["tasks"]):
137141
if index > 3:

tools/listing.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858

5959
- title: "Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models"
6060
description: |
61-
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
61+
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
6262
path: src/inspect_evals/cybench
6363
group: Cybersecurity
6464
contributors: ["sinman-aisi", "sam-deverett-dsit", "kola-aisi", "pgiav"]
@@ -118,7 +118,7 @@
118118

119119
- title: "GSM8K: Training Verifiers to Solve Math Word Problems"
120120
description: |
121-
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.
121+
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demonstrates fewshot prompting.
122122
path: src/inspect_evals/gsm8k
123123
arxiv: https://arxiv.org/abs/2110.14168
124124
group: Mathematics
@@ -180,10 +180,9 @@
180180
contributors: ["seddy-aisi"]
181181
tasks: ["piqa"]
182182

183-
184183
- title: "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens"
185184
description: |
186-
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
185+
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
187186
path: src/inspect_evals/infinite_bench
188187
arxiv: https://arxiv.org/abs/2402.13718
189188
group: Reasoning

0 commit comments

Comments
 (0)