You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+17-12Lines changed: 17 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -113,7 +113,6 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
113
113
inspect eval inspect_evals/cybench
114
114
```
115
115
116
-
117
116
-### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
118
117
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
@@ -132,30 +131,36 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
132
131
inspect eval inspect_evals/gdm_intercode_ctf
133
132
```
134
133
135
-
136
134
-### [GDM Dangerous Capabilities: Capture the Flag](src/inspect_evals/gdm_capabilities/in_house_ctf)
137
135
CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
-### [SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence](src/inspect_evals/sevenllm)
144
-
[SEvenLLM](https://arxiv.org/abs/2405.03446) is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
141
+
142
+
-### [SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.](src/inspect_evals/sevenllm)
143
+
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
-### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
154
-
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
154
+
"Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
155
+
SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria.
156
+
The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
After running evaluations, you can view their logs using the `inspect view` command:
42
+
43
+
```bash
44
+
inspect view
45
+
```
46
+
47
+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
Copy file name to clipboardExpand all lines: tools/listing.yaml
+21-10Lines changed: 21 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -66,15 +66,6 @@
66
66
tasks: ["cybench"]
67
67
tags: ["Agent"]
68
68
69
-
- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
70
-
description: |
71
-
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
- title: "CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge"
79
70
description: |
80
71
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
@@ -123,6 +114,26 @@
123
114
tasks: ["gdm_in_house_ctf"]
124
115
tags: ["Agent"]
125
116
117
+
- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
118
+
description: |
119
+
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
0 commit comments