You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+18Lines changed: 18 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,6 +114,17 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
114
114
```
115
115
116
116
117
+
-### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118
+
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
-### [InterCode: Capture the Flag](src/inspect_evals/gdm_capabilities/intercode_ctf)
118
129
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
@@ -129,6 +140,13 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
129
140
inspect eval inspect_evals/gdm_in_house_ctf
130
141
```
131
142
143
+
-### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
144
+
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
*Note*: At the moment, we release 44 out of 66 public test base behaviors (176 augmented ones) and 8 out of 11 validation base behaviors (32 augmented ones). We hope to release additional behaviors in the future. The average harm score for direct requests (i.e., without jailbreaks) on this initial set of behaviors slightly increases, e.g., from 48.4% to 54.9% on GPT-4o and from 82.2% to 83.6% on Mistral.
17
17
@@ -29,8 +29,9 @@ First, install the inspect_ai and inspect_evals Python packages with:
In VsCode, make sure to download the Inspect AI extension for viewing logs. For more details see https://inspect.ai-safety-institute.org.uk/log-viewer.html.
32
33
33
-
Examples of running evaluations on two latest proprietary models:
34
+
Examples of running evaluations on two proprietary models:
[CyberMetric](https://arxiv.org/abs/2402.07688) is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
4
+
5
+
Contributed by [@neilshaabi](https://github.com/neilshaabi)
6
+
7
+
## Usage
8
+
9
+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
After running evaluations, you can view their logs using the `inspect view` command:
24
+
25
+
```bash
26
+
inspect view
27
+
```
28
+
29
+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
See `inspect eval --help` for all available options.
46
+
47
+
## Dataset
48
+
49
+
Here is an example prompt from the dataset (after it has been further processed by Inspect):
50
+
51
+
>Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.
52
+
>
53
+
>In cryptography, what is the purpose of using a key-derivation function (KDF)?
54
+
>
55
+
>
56
+
>A) Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
57
+
58
+
The model is then tasked to pick the correct choice.
59
+
60
+
## Scoring
61
+
62
+
A simple accuracy is calculated over the datapoints.
0 commit comments