Skip to content

Commit 8647315

Browse files
committed
initial SecQA v1 benchmark implementation
1 parent ca01674 commit 8647315

File tree

4 files changed

+57
-0
lines changed

4 files changed

+57
-0
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,13 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
129129
inspect eval inspect_evals/gdm_in_house_ctf
130130
```
131131

132+
- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
133+
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
134+
<sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
135+
```
136+
inspect eval inspect_evals/secqa_v1
137+
inspect eval inspect_evals/secqa_v2
138+
```
132139

133140
## Safeguards
134141

src/inspect_evals/_registry.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
from .piqa import piqa
5050
from .pubmedqa import pubmedqa
5151
from .race_h import race_h
52+
from .sec_qa import sec_qa_v1, sec_qa_v2
5253
from .squad import squad
5354
from .swe_bench import swe_bench
5455
from .truthfulqa import truthfulqa
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .sec_qa import sec_qa_v1, sec_qa_v2
2+
3+
__all__ = ["sec_qa_v1", "sec_qa_v2"]

src/inspect_evals/sec_qa/sec_qa.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
"""
2+
SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security
3+
4+
Zefang Liu
5+
https://arxiv.org/pdf/2312.15838
6+
"""
7+
8+
from typing import Any
9+
10+
from inspect_ai import Task, task
11+
from inspect_ai.dataset import Sample, hf_dataset
12+
from inspect_ai.scorer import choice
13+
from inspect_ai.solver import multiple_choice, system_message
14+
15+
# setup for problem + instructions for providing answer
16+
SYSTEM_MESSAGE = """
17+
The following are multiple choice questions about Computer Security.
18+
""".strip()
19+
20+
21+
@task
22+
def sec_qa_v1() -> Task:
23+
"""Inspect Task implementing the SecQA benchmark v1"""
24+
# dataset
25+
dataset = hf_dataset(
26+
path="zefang-liu/secqa",
27+
name="secqa_v1", # aka subset
28+
split="dev",
29+
sample_fields=record_to_sample,
30+
trust=True,
31+
)
32+
33+
# define task
34+
return Task(
35+
dataset=dataset,
36+
solver=[system_message(SYSTEM_MESSAGE), multiple_choice()],
37+
scorer=choice(),
38+
)
39+
40+
41+
def record_to_sample(record: dict[str, Any]) -> Sample:
42+
return Sample(
43+
input=record["Question"],
44+
choices=[record["A"], record["B"], record["C"], record["D"]],
45+
target=record["Answer"],
46+
)

0 commit comments

Comments
 (0)