Skip to content

Commit 4445b86

Browse files
Merge pull request rllm-org#119 from neilshaabi/cybermetric
CyberMetric Benchmark Implementation | ASET - Arcadia Impact
2 parents 7b20369 + 5393419 commit 4445b86

File tree

11 files changed

+128218
-0
lines changed

11 files changed

+128218
-0
lines changed

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,17 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
114114
```
115115

116116

117+
- ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118+
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
119+
<sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
120+
```
121+
inspect eval inspect_evals/cybermetric_80
122+
inspect eval inspect_evals/cybermetric_500
123+
inspect eval inspect_evals/cybermetric_2000
124+
inspect eval inspect_evals/cybermetric_10000
125+
```
126+
127+
117128
- ### [InterCode: Capture the Flag](src/inspect_evals/gdm_capabilities/intercode_ctf)
118129
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
119130
<sub><sup>Contributed by: [@jjallaire](https://github.com/jjallaire)</sub></sup>

src/inspect_evals/_registry.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@
1818
from .boolq import boolq
1919
from .commonsense_qa import commonsense_qa
2020
from .cybench import cybench
21+
from .cybermetric import (
22+
cybermetric_80,
23+
cybermetric_500,
24+
cybermetric_2000,
25+
cybermetric_10000,
26+
)
2127
from .drop import drop
2228
from .ds1000 import ds1000
2329
from .gaia import gaia, gaia_level1, gaia_level2, gaia_level3
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# CyberMetric
2+
3+
[CyberMetric](https://arxiv.org/abs/2402.07688) is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
4+
5+
Contributed by [@neilshaabi](https://github.com/neilshaabi)
6+
7+
## Usage
8+
9+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
10+
```bash
11+
pip install inspect_ai
12+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
13+
```
14+
15+
Then, evaluate against one or more models with:
16+
```bash
17+
inspect eval inspect_evals/cybermetric_80 --model openai/gpt-4o
18+
inspect eval inspect_evals/cybermetric_500 --model openai/gpt-4o
19+
inspect eval inspect_evals/cybermetric_2000 --model openai/gpt-4o
20+
inspect eval inspect_evals/cybermetric_10000 --model openai/gpt-4o
21+
```
22+
23+
After running evaluations, you can view their logs using the `inspect view` command:
24+
25+
```bash
26+
inspect view
27+
```
28+
29+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
30+
31+
```bash
32+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
33+
ANTHROPIC_API_KEY=<anthropic-api-key>
34+
```
35+
36+
## Options
37+
38+
You can control a variety of options from the command line. For example:
39+
```bash
40+
inspect eval inspect_evals/cybermetric_80 --limit 10
41+
inspect eval inspect_evals/cybermetric_500 --max-connections 10
42+
inspect eval inspect_evals/cybermetric_2000 --temperature 0.5
43+
```
44+
45+
See `inspect eval --help` for all available options.
46+
47+
## Dataset
48+
49+
Here is an example prompt from the dataset (after it has been further processed by Inspect):
50+
51+
>Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.
52+
>
53+
>In cryptography, what is the purpose of using a key-derivation function (KDF)?
54+
>
55+
>
56+
>A) Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
57+
58+
The model is then tasked to pick the correct choice.
59+
60+
## Scoring
61+
62+
A simple accuracy is calculated over the datapoints.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .cybermetric import (
2+
cybermetric_80,
3+
cybermetric_500,
4+
cybermetric_2000,
5+
cybermetric_10000,
6+
)
7+
8+
__all__ = ["cybermetric_80", "cybermetric_500", "cybermetric_2000", "cybermetric_10000"]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
"""
2+
3+
CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge
4+
5+
Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Tamas Bisztray, Merouane Debbah
6+
https://arxiv.org/abs/2402.07688
7+
8+
# Run all subsets
9+
inspect eval cybermetric.py
10+
11+
# Run specific subsets
12+
inspect eval cybermetric.py@cybermetric_80
13+
inspect eval cybermetric.py@cybermetric_500
14+
inspect eval cybermetric.py@cybermetric_2000
15+
inspect eval cybermetric.py@cybermetric_10000
16+
17+
"""
18+
19+
from pathlib import Path
20+
from typing import Any
21+
22+
from datasets import Dataset # type: ignore
23+
from inspect_ai import Task, task
24+
from inspect_ai.dataset import Sample, json_dataset
25+
from inspect_ai.scorer import choice
26+
from inspect_ai.solver import multiple_choice, system_message
27+
28+
DATASET_DIR = Path(__file__).parent / "datasets"
29+
SYSTEM_MESSAGE = "You are a security expert who answers questions."
30+
31+
32+
@task
33+
def cybermetric_80() -> Task:
34+
"""Evaluates model on CyberMetric-80"""
35+
return create_task(80)
36+
37+
38+
@task
39+
def cybermetric_500() -> Task:
40+
"""Evaluates model on CyberMetric-500"""
41+
return create_task(500)
42+
43+
44+
@task
45+
def cybermetric_2000() -> Task:
46+
"""Evaluates model on CyberMetric-2000"""
47+
return create_task(2000)
48+
49+
50+
@task
51+
def cybermetric_10000() -> Task:
52+
"""Evaluates model on CyberMetric-10000"""
53+
return create_task(10000)
54+
55+
56+
def create_task(dataset_size: int) -> Task:
57+
name = f"CyberMetric-{dataset_size}"
58+
return Task(
59+
name=name,
60+
dataset=load_dataset(name),
61+
solver=[system_message(SYSTEM_MESSAGE), multiple_choice()],
62+
scorer=choice(),
63+
)
64+
65+
66+
def load_dataset(name: str) -> Dataset:
67+
dataset = json_dataset(
68+
json_file=f"{DATASET_DIR}/{name.lower()}.json",
69+
name=name,
70+
sample_fields=record_to_sample,
71+
auto_id=True,
72+
)
73+
return dataset
74+
75+
76+
def record_to_sample(record: dict[str, Any]) -> Sample:
77+
options = ", ".join([f"{key}) {value}" for key, value in record["answers"].items()])
78+
input = f"Question: {record['question']}\nOptions: {options}\n\n"
79+
return Sample(
80+
input=input, choices=list(record["answers"].keys()), target=record["solution"]
81+
)

0 commit comments

Comments
 (0)