Skip to content

Commit 4646526

Browse files
Merge pull request rllm-org#155 from MattFisher/cleanup/secqa
Follow-up: SecQA
2 parents 299f69c + 67834f2 commit 4646526

File tree

3 files changed

+128
-22
lines changed

3 files changed

+128
-22
lines changed

README.md

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,6 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
113113
inspect eval inspect_evals/cybench
114114
```
115115

116-
117116
- ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118117
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
119118
<sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
@@ -132,30 +131,36 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
132131
inspect eval inspect_evals/gdm_intercode_ctf
133132
```
134133

135-
136134
- ### [GDM Dangerous Capabilities: Capture the Flag](src/inspect_evals/gdm_capabilities/in_house_ctf)
137135
CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
138136
<sub><sup>Contributed by: [@XkunW](https://github.com/XkunW)</sub></sup>
139-
```
137+
138+
```bash
140139
inspect eval inspect_evals/gdm_in_house_ctf
141140
```
142-
143-
- ### [SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence](src/inspect_evals/sevenllm)
144-
[SEvenLLM](https://arxiv.org/abs/2405.03446) is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
141+
142+
- ### [SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.](src/inspect_evals/sevenllm)
143+
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
145144
<sub><sup>Contributed by: [@kingroryg](https://github.com/kingroryg)</sub></sup>
146-
```
145+
146+
```bash
147147
inspect eval inspect_evals/sevenllm_mcq_zh
148148
inspect eval inspect_evals/sevenllm_mcq_en
149149
inspect eval inspect_evals/sevenllm_qa_zh
150150
inspect eval inspect_evals/sevenllm_qa_en
151151
```
152152

153153
- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
154-
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
154+
"Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
155+
SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria.
156+
The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
155157
<sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
156-
```
157-
inspect eval inspect_evals/secqa_v1
158-
inspect eval inspect_evals/secqa_v2
158+
159+
```bash
160+
inspect eval inspect_evals/sec_qa_v1
161+
inspect eval inspect_evals/sec_qa_v1_5_shot
162+
inspect eval inspect_evals/sec_qa_v2
163+
inspect eval inspect_evals/sec_qa_v2_5_shot
159164
```
160165

161166
## Safeguards
@@ -402,4 +407,4 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
402407
```
403408

404409

405-
<!-- /Eval Listing: Automatically Generated -->
410+
<!-- /Eval Listing: Automatically Generated -->

src/inspect_evals/sec_qa/README.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# SecQA
2+
3+
[SecQA](https://arxiv.org/abs/2312.15838) (Security Question Answering) is a
4+
benchmark for evaluating the performance of Large Language Models (LLMs) in the
5+
domain of computer security. Utilizing multiple-choice questions generated by
6+
GPT-4 based on the "Computer Systems Security: Planning for Success" textbook,
7+
SecQA aims to assess LLMs' understanding and application of security principles.
8+
9+
SecQA is organized into two versions: v1 and v2. Both are implemented here in
10+
both zero-shot and 5-shot settings. Version 1 is designed to assess foundational
11+
understanding, presenting questions that cover basic concepts and widely
12+
recognized security principles. It serves as a preliminary test to gauge LLMs’
13+
basic comprehension and application of security knowledge. Version 2 introduces
14+
a higher level of difficulty with more complex and nuanced questions, pushing
15+
LLMs to demonstrate a more profound understanding and advanced reasoning in this
16+
domain.
17+
18+
<!-- Contributors: Automatically Generated -->
19+
Contributed by [@matthewreed26](https://github.com/matthewreed26)
20+
<!-- /Contributors: Automatically Generated -->
21+
22+
<!-- Usage: Automatically Generated -->
23+
## Usage
24+
25+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
26+
27+
```bash
28+
pip install inspect_ai
29+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
30+
```
31+
32+
Then, evaluate against one or more models with:
33+
34+
```bash
35+
inspect eval inspect_evals/sec_qa_v1 --model openai/gpt-4o
36+
inspect eval inspect_evals/sec_qa_v1_5_shot --model openai/gpt-4o
37+
inspect eval inspect_evals/sec_qa_v2 --model openai/gpt-4o
38+
inspect eval inspect_evals/sec_qa_v2_5_shot --model openai/gpt-4o
39+
```
40+
41+
After running evaluations, you can view their logs using the `inspect view` command:
42+
43+
```bash
44+
inspect view
45+
```
46+
47+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
48+
49+
```bash
50+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
51+
ANTHROPIC_API_KEY=<anthropic-api-key>
52+
```
53+
<!-- /Usage: Automatically Generated -->
54+
55+
<!-- Options: Automatically Generated -->
56+
## Options
57+
58+
You can control a variety of options from the command line. For example:
59+
60+
```bash
61+
inspect eval inspect_evals/sec_qa_v1 --limit 10
62+
inspect eval inspect_evals/sec_qa_v1_5_shot --max-connections 10
63+
inspect eval inspect_evals/sec_qa_v2 --temperature 0.5
64+
```
65+
66+
See `inspect eval --help` for all available options.
67+
<!-- /Options: Automatically Generated -->
68+
69+
## Dataset
70+
71+
Here is an example from the dataset:
72+
73+
> Question: What is the purpose of implementing a Guest Wireless Network in a
74+
> corporate environment?
75+
>
76+
> A) To provide unrestricted access to company resources
77+
>
78+
> B) To offer a separate, secure network for visitors
79+
>
80+
> C) To replace the primary corporate wireless network
81+
>
82+
> D) To bypass network security protocols
83+
>
84+
> Correct Answer: B
85+
86+
The model is tasked to choose one of the four options.
87+
88+
## Scoring
89+
90+
A simple accuracy is calculated over the datapoints.

tools/listing.yaml

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -66,15 +66,6 @@
6666
tasks: ["cybench"]
6767
tags: ["Agent"]
6868

69-
- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
70-
description: |
71-
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
72-
path: src/inspect_evals/sevenllm
73-
group: Cybersecurity
74-
contributors: ["kingroryg"]
75-
arxiv: https://arxiv.org/abs/2405.03446
76-
tasks: ["sevenllm_mcq_zh", "sevenllm_mcq_en", "sevenllm_qa_zh", "sevenllm_qa_en"]
77-
7869
- title: "CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge"
7970
description: |
8071
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
@@ -123,6 +114,26 @@
123114
tasks: ["gdm_in_house_ctf"]
124115
tags: ["Agent"]
125116

117+
- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
118+
description: |
119+
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
120+
path: src/inspect_evals/sevenllm
121+
group: Cybersecurity
122+
contributors: ["kingroryg"]
123+
arxiv: https://arxiv.org/abs/2405.03446
124+
tasks: ["sevenllm_mcq_zh", "sevenllm_mcq_en", "sevenllm_qa_zh", "sevenllm_qa_en"]
125+
126+
- title: "SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security"
127+
description: |
128+
"Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
129+
SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria.
130+
The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
131+
path: src/inspect_evals/sec_qa
132+
group: Cybersecurity
133+
contributors: ["matthewreed26"]
134+
arxiv: https://arxiv.org/abs/2312.15838
135+
tasks: ["sec_qa_v1", "sec_qa_v1_5_shot", "sec_qa_v2", "sec_qa_v2_5_shot"]
136+
126137
- title: "AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents"
127138
description: |
128139
Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.
@@ -400,4 +411,4 @@
400411
"agie_sat_en",
401412
"agie_sat_en_without_passage",
402413
"agie_sat_math",
403-
]
414+
]

0 commit comments

Comments
 (0)