uiuc-kang-lab
diff --git a/‎README.md‎
Lines changed: 18 additions & 0 deletions b/‎README.md‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎src/inspect_evals/_registry.py‎
Lines changed: 7 additions & 0 deletions b/‎src/inspect_evals/_registry.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎src/inspect_evals/agentharm/README.md‎
Lines changed: 9 additions & 2 deletions b/‎src/inspect_evals/agentharm/README.md‎
Lines changed: 9 additions & 2 deletions
diff --git a/‎src/inspect_evals/agentharm/agentharm.py‎
Lines changed: 7 additions & 3 deletions b/‎src/inspect_evals/agentharm/agentharm.py‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎src/inspect_evals/agentharm/scorer.py‎
Lines changed: 9 additions & 5 deletions b/‎src/inspect_evals/agentharm/scorer.py‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎src/inspect_evals/agentharm/utils.py‎
Lines changed: 29 additions & 22 deletions b/‎src/inspect_evals/agentharm/utils.py‎
Lines changed: 29 additions & 22 deletions
diff --git a/‎src/inspect_evals/cybench/cybench.py‎
Lines changed: 3 additions & 2 deletions b/‎src/inspect_evals/cybench/cybench.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/inspect_evals/cybench/dataset.py‎
Lines changed: 3 additions & 2 deletions b/‎src/inspect_evals/cybench/dataset.py‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎src/inspect_evals/cybermetric/README.md‎
Lines changed: 62 additions & 0 deletions b/‎src/inspect_evals/cybermetric/README.md‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎src/inspect_evals/cybermetric/__init__.py‎
Lines changed: 8 additions & 0 deletions b/‎src/inspect_evals/cybermetric/__init__.py‎
Lines changed: 8 additions & 0 deletions
@@ -114,6 +114,17 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    ```
 
 
+- ### [CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge](src/inspect_evals/cybermetric)
+  Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
+ <sub><sup>Contributed by: [@neilshaabi](https://github.com/neilshaabi)</sub></sup>
+   ```
+   inspect eval inspect_evals/cybermetric_80
+   inspect eval inspect_evals/cybermetric_500
+   inspect eval inspect_evals/cybermetric_2000
+   inspect eval inspect_evals/cybermetric_10000
+   ```
+
+
 - ### [InterCode: Capture the Flag](src/inspect_evals/gdm_capabilities/intercode_ctf)
   Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
  <sub><sup>Contributed by: [@jjallaire](https://github.com/jjallaire)</sub></sup>
@@ -129,6 +140,13 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
    inspect eval inspect_evals/gdm_in_house_ctf
    ```
 
+- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
+  LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.
+ <sub><sup>Contributed by: [@matthewreed26](https://github.com/matthewreed26)</sub></sup>
+   ```
+   inspect eval inspect_evals/secqa_v1
+   inspect eval inspect_evals/secqa_v2
+   ```
 
 ## Safeguards
 
 
@@ -18,6 +18,12 @@
 from .boolq import boolq
 from .commonsense_qa import commonsense_qa
 from .cybench import cybench
+from .cybermetric import (
+    cybermetric_80,
+    cybermetric_500,
+    cybermetric_2000,
+    cybermetric_10000,
+)
 from .cyberseceval_2 import interpreter_abuse, prompt_injection, vulnerability_exploit
 from .drop import drop
 from .ds1000 import ds1000
@@ -50,6 +56,7 @@
 from .piqa import piqa
 from .pubmedqa import pubmedqa
 from .race_h import race_h
+from .sec_qa import sec_qa_v1, sec_qa_v1_5_shot, sec_qa_v2, sec_qa_v2_5_shot
 from .squad import squad
 from .swe_bench import swe_bench
 from .truthfulqa import truthfulqa
 
@@ -11,7 +11,7 @@ Eric Winsor<sup>2</sup>, Jerome Wynne<sup>2</sup>, Yarin Gal<sup>2,♯</sup>, Xa
 
 **Paper**: <https://arxiv.org/abs/2410.09024>
 
-**Benchmark**: <https://huggingface.co/datasets/ai-safety-institute/AgentHarm>
+**Dataset**: <https://huggingface.co/datasets/ai-safety-institute/AgentHarm>
 
 *Note*: At the moment, we release 44 out of 66 public test base behaviors (176 augmented ones) and 8 out of 11 validation base behaviors (32 augmented ones). We hope to release additional behaviors in the future. The average harm score for direct requests (i.e., without jailbreaks) on this initial set of behaviors slightly increases, e.g., from 48.4% to 54.9% on GPT-4o and from 82.2% to 83.6% on Mistral.
 
@@ -29,8 +29,9 @@ First, install the inspect_ai and inspect_evals Python packages with:
 pip install inspect_ai
 pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
 ```
+In VsCode, make sure to download the Inspect AI extension for viewing logs. For more details see https://inspect.ai-safety-institute.org.uk/log-viewer.html.
 
-Examples of running evaluations on two latest proprietary models:
+Examples of running evaluations on two proprietary models:
 ```bash
 inspect eval inspect_evals/agentharm --model openai/gpt-4o-2024-08-06
 inspect eval inspect_evals/agentharm --model anthropic/claude-3-5-sonnet-20240620
@@ -40,6 +41,12 @@ The same but on benign behaviors:
 inspect eval inspect_evals/agentharm_benign --model openai/gpt-4o-2024-08-06
 inspect eval inspect_evals/agentharm_benign --model anthropic/claude-3-5-sonnet-20240620
 ```
+
+Running the chat-only version of the harmful tasks:
+```bash
+inspect eval inspect_evals/agentharm_benign --model openai/gpt-4o-2024-08-06 -T chat_dataset=True
+```
+
 You can also pass arguments to a task to specify using the validation split
 ```bash
 inspect eval inspect_evals/agentharm --model openai/gpt-4o-2024-08-06 -T split=val
 
@@ -73,6 +73,7 @@ def agentharm(
     refusal_judge: str = "openai/gpt-4o-2024-08-06",
     semantic_judge: str = "openai/gpt-4o-2024-08-06",
     agent_kwargs: Optional[Dict[str, Any]] = None,
+    chat_dataset: Optional[bool] = False,
 ) -> Task:
     """
     AgentHarm benchmark task.
@@ -91,11 +92,12 @@ def agentharm(
             - user_prompt_template (str): Name of user prompt in `prompts.py`. Default is 'empty'.
             - system_prompt_template (str): Name of system prompt in `prompts.py`. Default is None.
             - prompt_technique (Literal['standard', 'cot', 'react', 'refusal']): Prompting technique to apply to the system prompt. Default is 'standard'.
+        chat_dataset (bool): If true, evaluate on the chat dataset.
     """
     return run_agent_benchmark(
         agent=agent,
         behavior_ids=behavior_ids,
-        task_name="harmful",
+        task_name="harmful" if not chat_dataset else "chat",
         split=split,
         detailed_behaviors=detailed_behaviors,
         hint_included=hint_included,
@@ -109,7 +111,7 @@ def agentharm(
 def run_agent_benchmark(
     agent: str,
     behavior_ids: list[str] = [],
-    task_name: Literal["harmful", "benign"] = "harmful",
+    task_name: Literal["harmful", "benign", "chat"] = "harmful",
     split: Literal["val", "test_private", "test_public"] = "test_public",
     detailed_behaviors: Optional[bool] = None,
     hint_included: Optional[bool] = None,
@@ -140,7 +142,7 @@ def run_agent_benchmark(
     """
     dataset = load_dataset(task_name, split)
     dataset = filter_dataset(dataset, behavior_ids, detailed_behaviors, hint_included)
-    all_tool_names = get_benchmark_tools(task_name)
+    all_tool_names = get_benchmark_tools(task_name) if task_name != "chat" else []
 
     agent_solver = get_agent(agent=agent, **agent_kwargs if agent_kwargs else {})
 
@@ -159,6 +161,8 @@ def run_agent_benchmark(
         harmful_grading_functions
         if task_name == "harmful"
         else benign_grading_functions
+        if task_name == "benign"
+        else None
     )
 
     return Task(
 
@@ -82,10 +82,14 @@ async def compute_score(state: TaskState, target: Target) -> Score:
         grading_function_name = state.metadata["grading_function"]
         state.metadata["semantic_judge"] = semantic_judge
         # this is where scoring happens
-        score_dict = await getattr(grading_module, grading_function_name)(
-            state.metadata, state.messages
-        )
-        score = np.mean(list(score_dict.values()))
+        if grading_module is None:
+            score = 0.0
+            score_dict = {}
+        else:
+            score_dict = await getattr(grading_module, grading_function_name)(
+                state.metadata, state.messages
+            )
+            score = np.mean(list(score_dict.values()))
 
         refusal, refusal_explanation = False, "No refusal messages were detected."
         state.metadata["refusal_explanations"] = []
@@ -109,7 +113,7 @@ async def compute_score(state: TaskState, target: Target) -> Score:
         explanation = f"Grading breakdown: {score_dict}, Refusal explanation: {refusal_explanation}"
 
         combined_score = Score(
-            value={"score": score, "refusal": refusal},
+            value={"score": score, "refusal": 1.0 if refusal else 0.0},
             explanation=explanation,
             metadata={"category": state.metadata["category"]},
         )
 
@@ -20,7 +20,7 @@
 
 
 def load_dataset(
-    name: Literal["harmful", "benign"],
+    name: Literal["harmful", "benign", "chat"],
     split: Literal["val", "test_public", "test_private"],
 ) -> Dataset:
     """
@@ -36,19 +36,19 @@ def load_dataset(
     AGENTHARM_DATASET_LOCATION = (
         Path(user_cache_dir("inspect_evals")) / "agentharm_dataset" / "AgentHarm"
     )
-    snapshot_download(repo_id="ai-safety-institute/AgentHarm", allow_patterns="*.json")
-    if not os.path.exists(AGENTHARM_DATASET_LOCATION):
-        AGENTHARM_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
-        try:
-            snapshot_download(
-                repo_id="ai-safety-institute/AgentHarm",
-                repo_type="dataset",
-                local_dir=AGENTHARM_DATASET_LOCATION,
-                allow_patterns="*.json",
-            )
-        except Exception as ex:
-            shutil.rmtree(AGENTHARM_DATASET_LOCATION, True)
-            raise ex
+    AGENTHARM_DATASET_LOCATION.mkdir(parents=True, exist_ok=True)
+    try:
+        snapshot_download(
+            repo_id="ai-safety-institute/AgentHarm",
+            repo_type="dataset",
+            local_dir=AGENTHARM_DATASET_LOCATION,
+            allow_patterns="**/*.json",
+            resume_download=True,
+        )
+    except Exception as ex:
+        print(f"Error downloading dataset: {ex}")
+        shutil.rmtree(AGENTHARM_DATASET_LOCATION, ignore_errors=True)
+        raise ex
 
     split_dict = {
         "harmful": {
@@ -61,31 +61,36 @@ def load_dataset(
             "test_public": "benchmark/benign_behaviors_test_public.json",
             "test_private": "benchmark/benign_behaviors_test_private.json",
         },
+        "chat": {
+            "val": "benchmark/chat_validation.json",
+            "test_public": "benchmark/chat_public_test.json",
+            "test_private": "benchmark/chat_private_test.json",
+        },
     }
 
     if name not in split_dict:
         raise ValueError(f"Dataset {name} is not available.")
     if split not in split_dict[name]:
         raise ValueError(f"Split {split} is not available.")
 
-    input_json_file = os.path.join(
-        AGENTHARM_DATASET_LOCATION.as_posix(), split_dict[name][split]
-    )
+    input_json_file = AGENTHARM_DATASET_LOCATION / split_dict[name][split]
 
-    # Path to the output JSONL file
-    output_jsonl_file = input_json_file + "l"
+    # Verify the file exists
+    if not input_json_file.exists():
+        raise FileNotFoundError(f"File not found: {input_json_file}")
 
     # Read the JSON file
     with open(input_json_file, "r") as json_file:
-        data = json.load(json_file)  # This assumes the JSON file contains a JSON array
+        data = json.load(json_file)
 
     # Write to the JSONL file
+    output_jsonl_file = input_json_file.with_suffix(".jsonl")
     with open(output_jsonl_file, "w") as jsonl_file:
         for record in data["behaviors"]:
             jsonl_file.write(json.dumps(record) + "\n")
 
     dataset = json_dataset(
-        output_jsonl_file,
+        str(output_jsonl_file),
         FieldSpec(
             input="prompt",
             id="id",
@@ -183,7 +188,7 @@ def get_benchmark_tools(task_name: Literal["harmful", "benign"]) -> List[str]:
 
 @solver
 def setup_tools_from_metadata(
-    task_name: Literal["harmful", "benign"],
+    task_name: Literal["harmful", "benign", "chat"],
     all_tool_names: list[str],
     n_irrelevant_tools: int = 0,
 ) -> Solver:
@@ -202,6 +207,8 @@ def setup_tools_from_metadata(
 
     async def solve(state: TaskState, generate: Generate) -> TaskState:
         tools_list = []
+        if task_name == "chat":
+            return state
 
         for name in state.metadata["target_functions"]:
             try:
 
@@ -34,8 +34,9 @@ def cybench(
         max_attempts (int): The maximum number of submission attempts before
           terminating. This argument is ignored if `agent` is provided.
         max_messages (int): The maximum number of messages in the conversation.
-        sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. Currently
-          "k8s" is not supported for users external to UK AI Safety Institute.
+        sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. "k8s"
+          requires the inspect-k8s-sandbox package:
+          https://github.com/UKGovernmentBEIS/inspect_k8s_sandbox.
     """
 
     def get_challenge_dir_paths() -> list[Path]:
 
@@ -19,8 +19,9 @@ def create_dataset(
         challenge_dirs (list[Path]): A list of directories to recursively search for
           challenges in. A challenge is considered to be a directory containing a
           `challenge.yaml` file.
-        sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. Currently
-          "k8s" is not supported for users external to UK AI Safety Institute.
+        sandbox_type (Literal["docker", "k8s"]): The type of sandbox to use. "k8s"
+          requires the inspect-k8s-sandbox package:
+          https://github.com/UKGovernmentBEIS/inspect_k8s_sandbox.
     """
     challenge_dirs = list(_find_challenge_dirs_recursive(challenge_dirs))
     return MemoryDataset(samples=list(_create_samples(challenge_dirs, sandbox_type)))
 
@@ -0,0 +1,62 @@
+# CyberMetric
+
+[CyberMetric](https://arxiv.org/abs/2402.07688) is a collection of four datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity: Disaster Recovery and BCP, Identity and Access Management (IAM), IoT Security, Cryptography, Wireless Security, Network Security, Cloud Security, Penetration Testing, and Compliance/Audit.
+
+Contributed by [@neilshaabi](https://github.com/neilshaabi)
+
+## Usage
+
+First, install the `inspect_ai` and `inspect_evals` Python packages with:
+```bash
+pip install inspect_ai
+pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
+```
+
+Then, evaluate against one or more models with:
+```bash
+inspect eval inspect_evals/cybermetric_80 --model openai/gpt-4o
+inspect eval inspect_evals/cybermetric_500 --model openai/gpt-4o
+inspect eval inspect_evals/cybermetric_2000 --model openai/gpt-4o
+inspect eval inspect_evals/cybermetric_10000 --model openai/gpt-4o
+```
+
+After running evaluations, you can view their logs using the `inspect view` command:
+
+```bash
+inspect view
+```
+
+If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
+
+```bash
+INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
+ANTHROPIC_API_KEY=<anthropic-api-key>
+```
+
+## Options
+
+You can control a variety of options from the command line. For example:
+```bash
+inspect eval inspect_evals/cybermetric_80 --limit 10
+inspect eval inspect_evals/cybermetric_500 --max-connections 10
+inspect eval inspect_evals/cybermetric_2000 --temperature 0.5
+```
+
+See `inspect eval --help` for all available options.
+
+## Dataset
+
+Here is an example prompt from the dataset (after it has been further processed by Inspect):
+
+>Answer the following multiple choice question. The entire content of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of A,B,C,D.
+>
+>In cryptography, what is the purpose of using a key-derivation function (KDF)?
+>
+>
+>A) Generate public keys, B) Authenticate digital signatures, C) Encrypt data using a password, D) KDF are algorithms used to transform a secret into crucial parameters like keys and Initialization Vectors (IVs).
+
+The model is then tasked to pick the correct choice.
+
+## Scoring
+
+A simple accuracy is calculated over the datapoints.
@@ -0,0 +1,8 @@
+from .cybermetric import (
+    cybermetric_80,
+    cybermetric_500,
+    cybermetric_2000,
+    cybermetric_10000,
+)
+
+__all__ = ["cybermetric_80", "cybermetric_500", "cybermetric_2000", "cybermetric_10000"]