Skip to content

Commit f8ed793

Browse files
Merge branch 'main' into main
2 parents 28fde8b + a9c1e4a commit f8ed793

File tree

16 files changed

+128555
-4
lines changed

16 files changed

+128555
-4
lines changed

README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,17 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
114114
```
115115

116116

117+
- ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118+
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
119+
<sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
120+
```
121+
inspect eval inspect_evals/cybermetric_80
122+
inspect eval inspect_evals/cybermetric_500
123+
inspect eval inspect_evals/cybermetric_2000
124+
inspect eval inspect_evals/cybermetric_10000
125+
```
126+
127+
117128
- ### [InterCode: Capture the Flag](src/inspect_evals/gdm_capabilities/intercode_ctf)
118129
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
119130
<sub><sup>Contributed by: [@jjallaire](https://github.com/jjallaire)</sub></sup>
@@ -139,6 +150,13 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
139150
inspect eval inspect_evals/sevenllm_qa_en
140151
```
141152

153+
- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
154+
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
155+
<sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
156+
```
157+
inspect eval inspect_evals/secqa_v1
158+
inspect eval inspect_evals/secqa_v2
159+
```
142160

143161
## Safeguards
144162

src/inspect_evals/_registry.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@
1818
from .boolq import boolq
1919
from .commonsense_qa import commonsense_qa
2020
from .cybench import cybench
21+
from .cybermetric import (
22+
cybermetric_80,
23+
cybermetric_500,
24+
cybermetric_2000,
25+
cybermetric_10000,
26+
)
2127
from .drop import drop
2228
from .ds1000 import ds1000
2329
from .gaia import gaia, gaia_level1, gaia_level2, gaia_level3
@@ -49,6 +55,7 @@
4955
from .piqa import piqa
5056
from .pubmedqa import pubmedqa
5157
from .race_h import race_h
58+
from .sec_qa import sec_qa_v1, sec_qa_v1_5_shot, sec_qa_v2, sec_qa_v2_5_shot
5259
from .sevenllm import sevenllm_mcq_en, sevenllm_mcq_zh, sevenllm_qa_en, sevenllm_qa_zh
5360
from .squad import squad
5461
from .swe_bench import swe_bench
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# CyberMetric
2+
3+
[CyberMetric](https://arxiv.org/abs/2402.07688) is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
4+
5+
Contributed by [@neilshaabi](https://github.com/neilshaabi)
6+
7+
## Usage
8+
9+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
10+
```bash
11+
pip install inspect_ai
12+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
13+
```
14+
15+
Then, evaluate against one or more models with:
16+
```bash
17+
inspect eval inspect_evals/cybermetric_80 --model openai/gpt-4o
18+
inspect eval inspect_evals/cybermetric_500 --model openai/gpt-4o
19+
inspect eval inspect_evals/cybermetric_2000 --model openai/gpt-4o
20+
inspect eval inspect_evals/cybermetric_10000 --model openai/gpt-4o
21+
```
22+
23+
After running evaluations, you can view their logs using the `inspect view` command:
24+
25+
```bash
26+
inspect view
27+
```
28+
29+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
30+
31+
```bash
32+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
33+
ANTHROPIC_API_KEY=<anthropic-api-key>
34+
```
35+
36+
## Options
37+
38+
You can control a variety of options from the command line. For example:
39+
```bash
40+
inspect eval inspect_evals/cybermetric_80 --limit 10
41+
inspect eval inspect_evals/cybermetric_500 --max-connections 10
42+
inspect eval inspect_evals/cybermetric_2000 --temperature 0.5
43+
```
44+
45+
See `inspect eval --help` for all available options.
46+
47+
## Dataset
48+
49+
Here is an example prompt from the dataset (after it has been further processed by Inspect):
50+
51+
>Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.
52+
>
53+
>In cryptography, what is the purpose of using a key-derivation function (KDF)?
54+
>
55+
>
56+
>A) Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
57+
58+
The model is then tasked to pick the correct choice.
59+
60+
## Scoring
61+
62+
A simple accuracy is calculated over the datapoints.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .cybermetric import (
2+
cybermetric_80,
3+
cybermetric_500,
4+
cybermetric_2000,
5+
cybermetric_10000,
6+
)
7+
8+
__all__ = ["cybermetric_80", "cybermetric_500", "cybermetric_2000", "cybermetric_10000"]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
"""
2+
3+
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
4+
5+
Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamas Bisztray, Merouane Debbah
6+
https://arxiv.org/abs/2402.07688
7+
8+
# Run all subsets
9+
inspect eval cybermetric.py
10+
11+
# Run specific subsets
12+
inspect eval cybermetric.py@cybermetric_80
13+
inspect eval cybermetric.py@cybermetric_500
14+
inspect eval cybermetric.py@cybermetric_2000
15+
inspect eval cybermetric.py@cybermetric_10000
16+
17+
"""
18+
19+
from pathlib import Path
20+
from typing import Any
21+
22+
from datasets import Dataset # type: ignore
23+
from inspect_ai import Task, task
24+
from inspect_ai.dataset import Sample, json_dataset
25+
from inspect_ai.scorer import choice
26+
from inspect_ai.solver import multiple_choice, system_message
27+
28+
DATASET_DIR = Path(__file__).parent / "datasets"
29+
SYSTEM_MESSAGE = "You are a security expert who answers questions."
30+
31+
32+
@task
33+
def cybermetric_80() -> Task:
34+
"""Evaluates model on CyberMetric-80"""
35+
return create_task(80)
36+
37+
38+
@task
39+
def cybermetric_500() -> Task:
40+
"""Evaluates model on CyberMetric-500"""
41+
return create_task(500)
42+
43+
44+
@task
45+
def cybermetric_2000() -> Task:
46+
"""Evaluates model on CyberMetric-2000"""
47+
return create_task(2000)
48+
49+
50+
@task
51+
def cybermetric_10000() -> Task:
52+
"""Evaluates model on CyberMetric-10000"""
53+
return create_task(10000)
54+
55+
56+
def create_task(dataset_size: int) -> Task:
57+
name = f"CyberMetric-{dataset_size}"
58+
return Task(
59+
name=name,
60+
dataset=load_dataset(name),
61+
solver=[system_message(SYSTEM_MESSAGE), multiple_choice()],
62+
scorer=choice(),
63+
)
64+
65+
66+
def load_dataset(name: str) -> Dataset:
67+
dataset = json_dataset(
68+
json_file=f"{DATASET_DIR}/{name.lower()}.json",
69+
name=name,
70+
sample_fields=record_to_sample,
71+
auto_id=True,
72+
)
73+
return dataset
74+
75+
76+
def record_to_sample(record: dict[str, Any]) -> Sample:
77+
options = ", ".join([f"{key}) {value}" for key, value in record["answers"].items()])
78+
input = f"Question: {record['question']}\nOptions: {options}\n\n"
79+
return Sample(
80+
input=input, choices=list(record["answers"].keys()), target=record["solution"]
81+
)

0 commit comments

Comments
 (0)