Skip to content

Commit 4edefae

Browse files
Merge branch 'main' into main
2 parents 7137cce + a9c1e4a commit 4edefae

File tree

25 files changed

+128635
-66
lines changed

25 files changed

+128635
-66
lines changed

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,17 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
114114
```
115115

116116

117+
- ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118+
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
119+
<sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
120+
```
121+
inspect eval inspect_evals/cybermetric_80
122+
inspect eval inspect_evals/cybermetric_500
123+
inspect eval inspect_evals/cybermetric_2000
124+
inspect eval inspect_evals/cybermetric_10000
125+
```
126+
127+
117128
- ### [InterCode: Capture the Flag](src/inspect_evals/gdm_capabilities/intercode_ctf)
118129
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
119130
<sub><sup>Contributed by: [@jjallaire](https://github.com/jjallaire)</sub></sup>
@@ -129,6 +140,13 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
129140
inspect eval inspect_evals/gdm_in_house_ctf
130141
```
131142

143+
- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
144+
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
145+
<sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
146+
```
147+
inspect eval inspect_evals/secqa_v1
148+
inspect eval inspect_evals/secqa_v2
149+
```
132150

133151
## Safeguards
134152

src/inspect_evals/_registry.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@
1818
from .boolq import boolq
1919
from .commonsense_qa import commonsense_qa
2020
from .cybench import cybench
21+
from .cybermetric import (
22+
cybermetric_80,
23+
cybermetric_500,
24+
cybermetric_2000,
25+
cybermetric_10000,
26+
)
2127
from .cyberseceval_2 import interpreter_abuse, prompt_injection, vulnerability_exploit
2228
from .drop import drop
2329
from .ds1000 import ds1000
@@ -50,6 +56,7 @@
5056
from .piqa import piqa
5157
from .pubmedqa import pubmedqa
5258
from .race_h import race_h
59+
from .sec_qa import sec_qa_v1, sec_qa_v1_5_shot, sec_qa_v2, sec_qa_v2_5_shot
5360
from .squad import squad
5461
from .swe_bench import swe_bench
5562
from .truthfulqa import truthfulqa

src/inspect_evals/agentharm/README.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Eric Winsor<sup>2</sup>, Jerome Wynne<sup>2</sup>, Yarin Gal<sup>2,♯</sup>, Xa
1111

1212
**Paper**: <https://arxiv.org/abs/2410.09024>
1313

14-
**Benchmark**: <https://huggingface.co/datasets/ai-safety-institute/AgentHarm>
14+
**Dataset**: <https://huggingface.co/datasets/ai-safety-institute/AgentHarm>
1515

1616
*Note*: At the moment, we release 44 out of 66 public test base behaviors (176 augmented ones) and 8 out of 11 validation base behaviors (32 augmented ones). We hope to release additional behaviors in the future. The average harm score for direct requests (i.e., without jailbreaks) on this initial set of behaviors slightly increases, e.g., from 48.4% to 54.9% on GPT-4o and from 82.2% to 83.6% on Mistral.
1717

@@ -29,8 +29,9 @@ First, install the inspect_ai and inspect_evals Python packages with:
2929
pip install inspect_ai
3030
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
3131
```
32+
In VsCode, make sure to download the Inspect AI extension for viewing logs. For more details see https://inspect.ai-safety-institute.org.uk/log-viewer.html.
3233

33-
Examples of running evaluations on two latest proprietary models:
34+
Examples of running evaluations on two proprietary models:
3435
```bash
3536
inspect eval inspect_evals/agentharm --model openai/gpt-4o-2024-08-06
3637
inspect eval inspect_evals/agentharm --model anthropic/claude-3-5-sonnet-20240620
@@ -40,6 +41,12 @@ The same but on benign behaviors:
4041
inspect eval inspect_evals/agentharm_benign --model openai/gpt-4o-2024-08-06
4142
inspect eval inspect_evals/agentharm_benign --model anthropic/claude-3-5-sonnet-20240620
4243
```
44+
45+
Running the chat-only version of the harmful tasks:
46+
```bash
47+
inspect eval inspect_evals/agentharm_benign --model openai/gpt-4o-2024-08-06 -T chat_dataset=True
48+
```
49+
4350
You can also pass arguments to a task to specify using the validation split
4451
```bash
4552
inspect eval inspect_evals/agentharm --model openai/gpt-4o-2024-08-06 -T split=val

src/inspect_evals/agentharm/agentharm.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ def agentharm(
7373
refusal_judge: str = "openai/gpt-4o-2024-08-06",
7474
semantic_judge: str = "openai/gpt-4o-2024-08-06",
7575
agent_kwargs: Optional[Dict[str, Any]] = None,
76+
chat_dataset: Optional[bool] = False,
7677
) -> Task:
7778
"""
7879
AgentHarm benchmark task.
@@ -91,11 +92,12 @@ def agentharm(
9192
- user_prompt_template (str): Name of user prompt in `prompts.py`. Default is 'empty'.
9293
- system_prompt_template (str): Name of system prompt in `prompts.py`. Default is None.
9394
- prompt_technique (Literal['standard', 'cot', 'react', 'refusal']): Prompting technique to apply to the system prompt. Default is 'standard'.
95+
chat_dataset (bool): If true, evaluate on the chat dataset.
9496
"""
9597
return run_agent_benchmark(
9698
agent=agent,
9799
behavior_ids=behavior_ids,
98-
task_name="harmful",
100+
task_name="harmful" if not chat_dataset else "chat",
99101
split=split,
100102
detailed_behaviors=detailed_behaviors,
101103
hint_included=hint_included,
@@ -109,7 +111,7 @@ def agentharm(
109111
def run_agent_benchmark(
110112
agent: str,
111113
behavior_ids: list[str] = [],
112-
task_name: Literal["harmful", "benign"] = "harmful",
114+
task_name: Literal["harmful", "benign", "chat"] = "harmful",
113115
split: Literal["val", "test_private", "test_public"] = "test_public",
114116
detailed_behaviors: Optional[bool] = None,
115117
hint_included: Optional[bool] = None,
@@ -140,7 +142,7 @@ def run_agent_benchmark(
140142
"""
141143
dataset = load_dataset(task_name, split)
142144
dataset = filter_dataset(dataset, behavior_ids, detailed_behaviors, hint_included)
143-
all_tool_names = get_benchmark_tools(task_name)
145+
all_tool_names = get_benchmark_tools(task_name) if task_name != "chat" else []
144146

145147
agent_solver = get_agent(agent=agent, **agent_kwargs if agent_kwargs else {})
146148

@@ -159,6 +161,8 @@ def run_agent_benchmark(
159161
harmful_grading_functions
160162
if task_name == "harmful"
161163
else benign_grading_functions
164+
if task_name == "benign"
165+
else None
162166
)
163167

164168
return Task(

src/inspect_evals/agentharm/scorer.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,10 +82,14 @@ async def compute_score(state: TaskState, target: Target) -> Score:
8282
grading_function_name = state.metadata["grading_function"]
8383
state.metadata["semantic_judge"] = semantic_judge
8484
# this is where scoring happens
85-
score_dict = await getattr(grading_module, grading_function_name)(
86-
state.metadata, state.messages
87-
)
88-
score = np.mean(list(score_dict.values()))
85+
if grading_module is None:
86+
score = 0.0
87+
score_dict = {}
88+
else:
89+
score_dict = await getattr(grading_module, grading_function_name)(
90+
state.metadata, state.messages
91+
)
92+
score = np.mean(list(score_dict.values()))
8993

9094
refusal, refusal_explanation = False, "No refusal messages were detected."
9195
state.metadata["refusal_explanations"] = []
@@ -109,7 +113,7 @@ async def compute_score(state: TaskState, target: Target) -> Score:
109113
explanation = f"Grading breakdown: {score_dict}, Refusal explanation: {refusal_explanation}"
110114

111115
combined_score = Score(
112-
value={"score": score, "refusal": refusal},
116+
value={"score": score, "refusal": 1.0 if refusal else 0.0},
113117
explanation=explanation,
114118
metadata={"category": state.metadata["category"]},
115119
)

src/inspect_evals/agentharm/utils.py

Lines changed: 29 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020

2121

2222
def load_dataset(
23-
name: Literal["harmful", "benign"],
23+
name: Literal["harmful", "benign", "chat"],
2424
split: Literal["val", "test_public", "test_private"],
2525
) -> Dataset:
2626
"""
@@ -36,19 +36,19 @@ def load_dataset(
3636
AGENTHARM_DATASET_LOCATION = (
3737
Path(user_cache_dir("inspect_evals")) / "agentharm_dataset" / "AgentHarm"
3838
)
39-
snapshot_download(repo_id="ai-safety-institute/AgentHarm", allow_patterns="*.json")
40-
if not os.path.exists(AGENTHARM_DATASET_LOCATION):
41-
AGENTHARM_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
42-
try:
43-
snapshot_download(
44-
repo_id="ai-safety-institute/AgentHarm",
45-
repo_type="dataset",
46-
local_dir=AGENTHARM_DATASET_LOCATION,
47-
allow_patterns="*.json",
48-
)
49-
except Exception as ex:
50-
shutil.rmtree(AGENTHARM_DATASET_LOCATION, True)
51-
raise ex
39+
AGENTHARM_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
40+
try:
41+
snapshot_download(
42+
repo_id="ai-safety-institute/AgentHarm",
43+
repo_type="dataset",
44+
local_dir=AGENTHARM_DATASET_LOCATION,
45+
allow_patterns="**/*.json",
46+
resume_download=True,
47+
)
48+
except Exception as ex:
49+
print(f"Error downloading dataset: {ex}")
50+
shutil.rmtree(AGENTHARM_DATASET_LOCATION, ignore_errors=True)
51+
raise ex
5252

5353
split_dict = {
5454
"harmful": {
@@ -61,31 +61,36 @@ def load_dataset(
6161
"test_public": "benchmark/benign_behaviors_test_public.json",
6262
"test_private": "benchmark/benign_behaviors_test_private.json",
6363
},
64+
"chat": {
65+
"val": "benchmark/chat_validation.json",
66+
"test_public": "benchmark/chat_public_test.json",
67+
"test_private": "benchmark/chat_private_test.json",
68+
},
6469
}
6570

6671
if name not in split_dict:
6772
raise ValueError(f"Dataset {name} is not available.")
6873
if split not in split_dict[name]:
6974
raise ValueError(f"Split {split} is not available.")
7075

71-
input_json_file = os.path.join(
72-
AGENTHARM_DATASET_LOCATION.as_posix(), split_dict[name][split]
73-
)
76+
input_json_file = AGENTHARM_DATASET_LOCATION / split_dict[name][split]
7477

75-
# Path to the output JSONL file
76-
output_jsonl_file = input_json_file + "l"
78+
# Verify the file exists
79+
if not input_json_file.exists():
80+
raise FileNotFoundError(f"File not found: {input_json_file}")
7781

7882
# Read the JSON file
7983
with open(input_json_file, "r") as json_file:
80-
data = json.load(json_file) # This assumes the JSON file contains a JSON array
84+
data = json.load(json_file)
8185

8286
# Write to the JSONL file
87+
output_jsonl_file = input_json_file.with_suffix(".jsonl")
8388
with open(output_jsonl_file, "w") as jsonl_file:
8489
for record in data["behaviors"]:
8590
jsonl_file.write(json.dumps(record) + "\n")
8691

8792
dataset = json_dataset(
88-
output_jsonl_file,
93+
str(output_jsonl_file),
8994
FieldSpec(
9095
input="prompt",
9196
id="id",
@@ -183,7 +188,7 @@ def get_benchmark_tools(task_name: Literal["harmful", "benign"]) -> List[str]:
183188

184189
@solver
185190
def setup_tools_from_metadata(
186-
task_name: Literal["harmful", "benign"],
191+
task_name: Literal["harmful", "benign", "chat"],
187192
all_tool_names: list[str],
188193
n_irrelevant_tools: int = 0,
189194
) -> Solver:
@@ -202,6 +207,8 @@ def setup_tools_from_metadata(
202207

203208
async def solve(state: TaskState, generate: Generate) -> TaskState:
204209
tools_list = []
210+
if task_name == "chat":
211+
return state
205212

206213
for name in state.metadata["target_functions"]:
207214
try:

src/inspect_evals/cybench/cybench.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,9 @@ def cybench(
3434
max_attempts (int): The maximum number of submission attempts before
3535
terminating. This argument is ignored if `agent` is provided.
3636
max_messages (int): The maximum number of messages in the conversation.
37-
sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. Currently
38-
"k8s" is not supported for users external to UK AI Safety Institute.
37+
sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. "k8s"
38+
requires the inspect-k8s-sandbox package:
39+
https://github.com/UKGovernmentBEIS/inspect_k8s_sandbox.
3940
"""
4041

4142
def get_challenge_dir_paths() -> list[Path]:

src/inspect_evals/cybench/dataset.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ def create_dataset(
1919
challenge_dirs (list[Path]): A list of directories to recursively search for
2020
challenges in. A challenge is considered to be a directory containing a
2121
`challenge.yaml` file.
22-
sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. Currently
23-
"k8s" is not supported for users external to UK AI Safety Institute.
22+
sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. "k8s"
23+
requires the inspect-k8s-sandbox package:
24+
https://github.com/UKGovernmentBEIS/inspect_k8s_sandbox.
2425
"""
2526
challenge_dirs = list(_find_challenge_dirs_recursive(challenge_dirs))
2627
return MemoryDataset(samples=list(_create_samples(challenge_dirs, sandbox_type)))
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# CyberMetric
2+
3+
[CyberMetric](https://arxiv.org/abs/2402.07688) is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
4+
5+
Contributed by [@neilshaabi](https://github.com/neilshaabi)
6+
7+
## Usage
8+
9+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
10+
```bash
11+
pip install inspect_ai
12+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
13+
```
14+
15+
Then, evaluate against one or more models with:
16+
```bash
17+
inspect eval inspect_evals/cybermetric_80 --model openai/gpt-4o
18+
inspect eval inspect_evals/cybermetric_500 --model openai/gpt-4o
19+
inspect eval inspect_evals/cybermetric_2000 --model openai/gpt-4o
20+
inspect eval inspect_evals/cybermetric_10000 --model openai/gpt-4o
21+
```
22+
23+
After running evaluations, you can view their logs using the `inspect view` command:
24+
25+
```bash
26+
inspect view
27+
```
28+
29+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
30+
31+
```bash
32+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
33+
ANTHROPIC_API_KEY=<anthropic-api-key>
34+
```
35+
36+
## Options
37+
38+
You can control a variety of options from the command line. For example:
39+
```bash
40+
inspect eval inspect_evals/cybermetric_80 --limit 10
41+
inspect eval inspect_evals/cybermetric_500 --max-connections 10
42+
inspect eval inspect_evals/cybermetric_2000 --temperature 0.5
43+
```
44+
45+
See `inspect eval --help` for all available options.
46+
47+
## Dataset
48+
49+
Here is an example prompt from the dataset (after it has been further processed by Inspect):
50+
51+
>Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.
52+
>
53+
>In cryptography, what is the purpose of using a key-derivation function (KDF)?
54+
>
55+
>
56+
>A) Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
57+
58+
The model is then tasked to pick the correct choice.
59+
60+
## Scoring
61+
62+
A simple accuracy is calculated over the datapoints.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .cybermetric import (
2+
cybermetric_80,
3+
cybermetric_500,
4+
cybermetric_2000,
5+
cybermetric_10000,
6+
)
7+
8+
__all__ = ["cybermetric_80", "cybermetric_500", "cybermetric_2000", "cybermetric_10000"]

0 commit comments

Comments
 (0)