Merge pull request rllm-org#155 from MattFisher/cleanup/secqa

alexandraabbas · web-flow · commit 46465264d782 · 2025-01-08T21:50:00.000Z
Follow-up: SecQA
diff --git a/README.md b/README.md
@@ -113,7 +113,6 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    inspect eval inspect_evals/cybench
    ```
 
-
 - ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
   Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
  <sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
@@ -132,30 +131,36 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    inspect eval inspect_evals/gdm_intercode_ctf
    ```
 
-
 - ### [GDM Dangerous Capabilities: Capture the Flag](src/inspect_evals/gdm_capabilities/in_house_ctf)
   CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
  <sub><sup>Contributed by: [@XkunW](https://github.com/XkunW)</sub></sup>
-   ```
+
+   ```bash
    inspect eval inspect_evals/gdm_in_house_ctf
    ```
-  
-- ### [SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence](src/inspect_evals/sevenllm)
-  [SEvenLLM](https://arxiv.org/abs/2405.03446) is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
+
+- ### [SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.](src/inspect_evals/sevenllm)
+  Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
  <sub><sup>Contributed by: [@kingroryg](https://github.com/kingroryg)</sub></sup>
-   ```
+
+   ```bash
    inspect eval inspect_evals/sevenllm_mcq_zh
    inspect eval inspect_evals/sevenllm_mcq_en
    inspect eval inspect_evals/sevenllm_qa_zh
    inspect eval inspect_evals/sevenllm_qa_en
    ```
 
 - ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
-  LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
+  "Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
+SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. 
+The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
  <sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
-   ```
-   inspect eval inspect_evals/secqa_v1
-   inspect eval inspect_evals/secqa_v2
+
+   ```bash
+   inspect eval inspect_evals/sec_qa_v1
+   inspect eval inspect_evals/sec_qa_v1_5_shot
+   inspect eval inspect_evals/sec_qa_v2
+   inspect eval inspect_evals/sec_qa_v2_5_shot
    ```
 
 ## Safeguards
@@ -402,4 +407,4 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    ```
 
 
-<!-- /Eval Listing: Automatically Generated -->
+<!-- /Eval Listing: Automatically Generated -->
diff --git a/src/inspect_evals/sec_qa/README.md b/src/inspect_evals/sec_qa/README.md
@@ -0,0 +1,90 @@
+# SecQA
+
+[SecQA](https://arxiv.org/abs/2312.15838) (Security Question Answering) is a
+benchmark for evaluating the performance of Large Language Models (LLMs) in the
+domain of computer security. Utilizing multiple-choice questions generated by
+GPT-4 based on the "Computer Systems Security: Planning for Success" textbook,
+SecQA aims to assess LLMs' understanding and application of security principles.
+
+SecQA is organized into two versions: v1 and v2. Both are implemented here in
+both zero-shot and 5-shot settings. Version 1 is designed to assess foundational
+understanding, presenting questions that cover basic concepts and widely
+recognized security principles. It serves as a preliminary test to gauge LLMs’
+basic comprehension and application of security knowledge. Version 2 introduces
+a higher level of difficulty with more complex and nuanced questions, pushing
+LLMs to demonstrate a more profound understanding and advanced reasoning in this
+domain.
+
+<!-- Contributors: Automatically Generated -->
+Contributed by [@matthewreed26](https://github.com/matthewreed26)
+<!-- /Contributors: Automatically Generated -->
+
+<!-- Usage: Automatically Generated -->
+## Usage
+
+First, install the `inspect_ai` and `inspect_evals` Python packages with:
+
+```bash
+pip install inspect_ai
+pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
+```
+
+Then, evaluate against one or more models with:
+
+```bash
+inspect eval inspect_evals/sec_qa_v1 --model openai/gpt-4o
+inspect eval inspect_evals/sec_qa_v1_5_shot --model openai/gpt-4o
+inspect eval inspect_evals/sec_qa_v2 --model openai/gpt-4o
+inspect eval inspect_evals/sec_qa_v2_5_shot --model openai/gpt-4o
+```
+
+After running evaluations, you can view their logs using the `inspect view` command:
+
+```bash
+inspect view
+```
+
+If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
+
+```bash
+INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
+ANTHROPIC_API_KEY=<anthropic-api-key>
+```
+<!-- /Usage: Automatically Generated -->
+
+<!-- Options: Automatically Generated -->
+## Options
+
+You can control a variety of options from the command line. For example:
+
+```bash
+inspect eval inspect_evals/sec_qa_v1 --limit 10
+inspect eval inspect_evals/sec_qa_v1_5_shot --max-connections 10
+inspect eval inspect_evals/sec_qa_v2 --temperature 0.5
+```
+
+See `inspect eval --help` for all available options.
+<!-- /Options: Automatically Generated -->
+
+## Dataset
+
+Here is an example from the dataset:
+
+> Question: What is the purpose of implementing a Guest Wireless Network in a
+> corporate environment?
+>
+> A) To provide unrestricted access to company resources
+>
+> B) To offer a separate, secure network for visitors
+>
+> C) To replace the primary corporate wireless network
+>
+> D) To bypass network security protocols
+>
+> Correct Answer: B
+
+The model is tasked to choose one of the four options.
+
+## Scoring
+
+A simple accuracy is calculated over the datapoints.
diff --git a/tools/listing.yaml b/tools/listing.yaml
@@ -66,15 +66,6 @@
   tasks: ["cybench"]
   tags: ["Agent"]
 
-- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
-  description: |
-    Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
-  path: src/inspect_evals/sevenllm
-  group: Cybersecurity
-  contributors: ["kingroryg"]
-  arxiv: https://arxiv.org/abs/2405.03446
-  tasks: ["sevenllm_mcq_zh", "sevenllm_mcq_en", "sevenllm_qa_zh", "sevenllm_qa_en"]
-
 - title: "CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge"
   description: |
     Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
@@ -123,6 +114,26 @@
   tasks: ["gdm_in_house_ctf"]
   tags: ["Agent"]
 
+- title: "SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events."
+  description: |
+    Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
+  path: src/inspect_evals/sevenllm
+  group: Cybersecurity
+  contributors: ["kingroryg"]
+  arxiv: https://arxiv.org/abs/2405.03446
+  tasks: ["sevenllm_mcq_zh", "sevenllm_mcq_en", "sevenllm_qa_zh", "sevenllm_qa_en"]
+
+- title: "SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security"
+  description: |
+    "Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
+    SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. 
+    The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
+  path: src/inspect_evals/sec_qa
+  group: Cybersecurity
+  contributors: ["matthewreed26"]
+  arxiv: https://arxiv.org/abs/2312.15838
+  tasks: ["sec_qa_v1", "sec_qa_v1_5_shot", "sec_qa_v2", "sec_qa_v2_5_shot"]
+
 - title: "AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents"
   description: |
     Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.
@@ -400,4 +411,4 @@
       "agie_sat_en",
       "agie_sat_en_without_passage",
       "agie_sat_math",
-    ]
+    ]