Skip to content

Commit b938c4f

Browse files
Merge pull request rllm-org#104 from kingroryg/main
SEvenLLM Benchmark Implementation | ASET - Arcadia Impact
2 parents cc3a9de + f8ed793 commit b938c4f

File tree

9 files changed

+580
-5
lines changed

9 files changed

+580
-5
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,16 @@ Inspect supports many model providers including OpenAI, Anthropic, Google, Mistr
139139
```
140140
inspect eval inspect_evals/gdm_in_house_ctf
141141
```
142+
143+
- ### [SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence](src/inspect_evals/sevenllm)
144+
[SEvenLLM](https://arxiv.org/abs/2405.03446) is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
145+
<sub><sup>Contributed by: [@kingroryg](https://github.com/kingroryg)</sub></sup>
146+
```
147+
inspect eval inspect_evals/sevenllm_mcq_zh
148+
inspect eval inspect_evals/sevenllm_mcq_en
149+
inspect eval inspect_evals/sevenllm_qa_zh
150+
inspect eval inspect_evals/sevenllm_qa_en
151+
```
142152

143153
- ### [SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security](src/inspect_evals/sec_qa)
144154
LLMs should be capable of understanding computer security principles and applying best practices to their responses. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions are generated by GPT-4 based on the “Computer Systems Security: Planning for Success” textbook and vetted by humans.

requirements.txt

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
11
datasets>=2.21
22
pillow
33
requests
4-
tiktoken
5-
6-
4+
tiktoken

src/inspect_evals/_registry.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@
5757
from .pubmedqa import pubmedqa
5858
from .race_h import race_h
5959
from .sec_qa import sec_qa_v1, sec_qa_v1_5_shot, sec_qa_v2, sec_qa_v2_5_shot
60+
from .sevenllm import sevenllm_mcq_en, sevenllm_mcq_zh, sevenllm_qa_en, sevenllm_qa_zh
6061
from .squad import squad
6162
from .swe_bench import swe_bench
6263
from .truthfulqa import truthfulqa
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# SEvenLLM
2+
3+
[SEvenLLM](https://arxiv.org/abs/2405.03446) is a benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.
4+
5+
<!-- Contributors: Automatically Generated -->
6+
Contributed by [@kingroryg](https://github.com/kingroryg)
7+
<!-- /Contributors: Automatically Generated -->
8+
9+
<!-- Usage: Automatically Generated -->
10+
## Usage
11+
12+
First, install the `inspect_ai` and `inspect_evals` Python packages with:
13+
```bash
14+
pip install inspect_ai
15+
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
16+
pip install jieba sentence_transformers rouge # packages required for the sevenllm benchmark
17+
```
18+
19+
Then, evaluate against one or more models with:
20+
```bash
21+
inspect eval inspect_evals/sevenllm_mcq_zh --model openai/gpt-4o
22+
inspect eval inspect_evals/sevenllm_mcq_en --model openai/gpt-4o
23+
inspect eval inspect_evals/sevenllm_qa_zh --model openai/gpt-4o
24+
inspect eval inspect_evals/sevenllm_qa_en --model openai/gpt-4o
25+
```
26+
27+
After running evaluations, you can view their logs using the `inspect view` command:
28+
29+
```bash
30+
inspect view
31+
```
32+
33+
If you don't want to specify the `--model` each time you run an evaluation, create a `.env` configuration file in your working directory that defines the `INSPECT_EVAL_MODEL` environment variable along with your API key. For example:
34+
35+
```bash
36+
INSPECT_EVAL_MODEL=anthropic/claude-3-5-sonnet-20240620
37+
ANTHROPIC_API_KEY=<anthropic-api-key>
38+
```
39+
<!-- /Usage: Automatically Generated -->
40+
41+
42+
## Dataset
43+
44+
The dataset described in the paper is a bilingual (English and Chinese) instruction dataset specifically created to train LLMs for cybersecurity tasks. It consists of 28 well-defined tasks, categorized into understanding tasks and generation tasks related to cybersecurity incident analysis.
45+
46+
1. Understanding Tasks (e.g., key entity recognition, malware feature extraction, vulnerability intelligence extraction)
47+
2. Generation Tasks (e.g., attack strategy analysis, trend prediction, summary generation)
48+
3. Size: benchmark contains 1,300 test samples with multiple-choice and query-answer questions.
49+
50+
>A phishing campaign targeted employees at a major company using fake login portals. The attacker utilized domain phishing[.]com and sent emails requesting password updates. This attack exploited a vulnerability in the email server to bypass spam filters.
51+
>
52+
> Task: Key Entity Recognition
53+
>
54+
> Instruction: Identify key entities such as attacker, victim, attack tools, and indicators of compromise from the text.
55+
56+
> Output:
57+
>
58+
> Attacker Information: Unknown (phishing campaign organizer)
59+
>
60+
> Victim: Employees of a major company
61+
>
62+
> Domain Name: phishing[.]com
63+
>
64+
> Vulnerability Exploited: Email server spam filter bypass
65+
>
66+
> Attack Tool: Fake login portals
67+
68+
69+
## Scoring
70+
71+
All evaluations involve key metrics such as Rouge-L, which measures overlap between model-generated and reference responses, and semantic similarity, which uses multilingual sentence transformers to assess the meaning alignment of outputs.
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .sevenllm import (
2+
sevenllm_mcq_en,
3+
sevenllm_mcq_zh,
4+
sevenllm_qa_en,
5+
sevenllm_qa_zh,
6+
)
7+
8+
__all__ = ["sevenllm_mcq_en", "sevenllm_mcq_zh", "sevenllm_qa_en", "sevenllm_qa_zh"]
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
import re
2+
3+
import jieba
4+
from inspect_ai.scorer import Score, Scorer, Target, mean, scorer, stderr
5+
from inspect_ai.solver import TaskState
6+
from rouge import Rouge
7+
from sentence_transformers import SentenceTransformer, util
8+
9+
SENTENCE_TRANSFORMER_REPO = (
10+
"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
11+
)
12+
ROUGE_L_THRESHOLD = 0.2
13+
14+
15+
def split_sentences(text: str, is_zh: bool) -> list[str]:
16+
"""Split text into sentences (for ROUGE-L scoring)."""
17+
if is_zh:
18+
sentences = re.findall(r"[^。!?]+[。!?]?", text)
19+
else:
20+
sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!)\s", text)
21+
sentences = [s.strip() for s in sentences if s.strip()]
22+
return sentences
23+
24+
25+
def tokenize_sentence(sentence: str, is_zh: bool) -> str:
26+
"""Tokenize sentences (for ROUGE-L scoring)."""
27+
return " ".join(jieba.cut(sentence)) if is_zh else sentence
28+
29+
30+
@scorer(metrics=[mean(), stderr()])
31+
def rouge_l_scorer(is_zh: bool) -> Scorer:
32+
"""ROUGE-L scorer for the SEVENLLM benchmark.
33+
34+
This scorer calculates the ROUGE-L score to evaluate the similarity between
35+
the model-generated output and the reference text on a per-sentence basis.
36+
The ROUGE-L metric focuses on the longest common subsequence (LCS) between
37+
the sentences, which captures the structure and fluency of the text. The scorer
38+
is designed to handle both English and Chinese texts, with sentence tokenization
39+
and splitting adjusted based on the `is_zh` flag.
40+
41+
The routine:
42+
- Splits the target reference and model-generated text into sentences.
43+
- Tokenizes each sentence (with support for Chinese tokenization).
44+
- Computes the ROUGE-L similarity score for each pair of sentences.
45+
- Aggregates the scores, with sentences scoring above a defined threshold
46+
(default: 0.2) contributing to the final score proportionally.
47+
48+
The total score reflects the percentage of sentences in the reference text
49+
that have sufficiently similar counterparts in the generated output,
50+
scaled to 100%.
51+
52+
Args:
53+
is_zh (bool): A flag indicating whether the input text is in Chinese (`True`)
54+
or English (`False`).
55+
56+
Returns:
57+
Scorer: A scoring function that computes the aggregated ROUGE-L score
58+
and provides the result alongside the model-generated output and a
59+
detailed explanation.
60+
"""
61+
rouge = Rouge()
62+
63+
async def score(state: TaskState, target: Target) -> Score:
64+
test_sentences = split_sentences(target.text, is_zh)
65+
generated_sentences = split_sentences(state.output.completion, is_zh)
66+
num_sentences = len(test_sentences)
67+
per_sentence_score = 100.0 / num_sentences # Adjust scaling as needed
68+
total_score = 0.0
69+
70+
for test_sentence in test_sentences:
71+
max_similarity = 0.0
72+
for gen_sentence in generated_sentences:
73+
test_sentence_tok = tokenize_sentence(test_sentence, is_zh)
74+
gen_sentence_tok = tokenize_sentence(gen_sentence, is_zh)
75+
76+
scores = rouge.get_scores(test_sentence_tok, gen_sentence_tok)
77+
similarity = scores[0]["rouge-l"]["f"]
78+
if similarity > max_similarity:
79+
max_similarity = similarity
80+
if max_similarity >= ROUGE_L_THRESHOLD:
81+
total_score += per_sentence_score
82+
83+
return Score(
84+
value=total_score,
85+
answer=state.output.completion,
86+
explanation=f"Accumulated ROUGE-L score: {total_score:.4f}",
87+
)
88+
89+
return score
90+
91+
92+
@scorer(metrics=[mean(), stderr()])
93+
def semantic_similarity_scorer() -> Scorer:
94+
"""Semantic similarity scorer for the SEVENLLM benchmark.
95+
96+
This scorer evaluates the semantic similarity between the model-generated output
97+
and the reference target text using a pre-trained SentenceTransformer model.
98+
It computes cosine similarity between the embeddings of the two texts, providing
99+
a quantitative measure of their alignment in meaning.
100+
101+
The routine:
102+
- Encodes both the model's output and the reference target text into embeddings
103+
using a SentenceTransformer.
104+
- Computes the cosine similarity between the two embeddings as the similarity score.
105+
- Returns the score along with the model output and an explanation string.
106+
107+
The score is aggregated using mean and standard error (stderr) metrics to evaluate
108+
model performance across the benchmark.
109+
110+
Returns:
111+
Scorer: A scoring function that computes semantic similarity and provides
112+
both the similarity score and a detailed explanation of the result.
113+
"""
114+
model = SentenceTransformer(SENTENCE_TRANSFORMER_REPO)
115+
116+
async def score(state: TaskState, target: Target) -> Score:
117+
model_output = state.output.completion
118+
target_text = target.text
119+
embeddings = model.encode([model_output, target_text], convert_to_tensor=True)
120+
cosine_similarity = util.pytorch_cos_sim(embeddings[0], embeddings[1])
121+
similarity_score = cosine_similarity.item()
122+
123+
return Score(
124+
value=similarity_score,
125+
answer=model_output,
126+
explanation=f"Semantic similarity score: {similarity_score:.4f}",
127+
)
128+
129+
return score
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
"""SEvenLLM: Benchmarking, Eliciting, and Enhancing Abilities of Large Language Models in Cyber Threat Intelligence
2+
3+
Hangyuan Ji, Jian Yang, Linzheng Chai, Chaoren Wei, Liqun Yang,
4+
Yunlong Duan, Yunli Wang, Tianzhen Sun, Hongcheng Guo, Tongliang Li,
5+
Changyu Ren, Zhoujun Li
6+
7+
https://arxiv.org/abs/2405.03446
8+
9+
# Eval for MCQs (Understanding) in Simplified Chinese (Zh)
10+
inspect eval inspect_evals/sevenllm_mcq_zh
11+
12+
# Eval for MCQs (Understanding) in English (En)
13+
inspect eval inspect_evals/sevenllm_mcq_en
14+
15+
# Eval for QAs (Generation) in Simplified Chinese (Zh)
16+
inspect eval inspect_evals/sevenllm_qa_zh
17+
18+
# Eval for QA (Generation) in English (En)
19+
inspect eval inspect_evals/sevenllm_qa_en
20+
"""
21+
22+
import json
23+
import re
24+
from typing import Any
25+
26+
from inspect_ai import Task, task
27+
from inspect_ai.dataset import Dataset, Sample, json_dataset
28+
from inspect_ai.scorer import choice
29+
from inspect_ai.solver import generate, multiple_choice, prompt_template
30+
31+
from inspect_evals.sevenllm.scorers import rouge_l_scorer, semantic_similarity_scorer
32+
33+
BENCHMARK_DATASET_URL = "https://huggingface.co/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset/raw/main/test.jsonl"
34+
# This prompt has been inspired from the inference work done by the authors.
35+
# See: https://github.com/CSJianYang/SEevenLLM/blob/main/infer.py
36+
# The prompt remains the same for both the languages.
37+
TEMPLATE = r"""
38+
Below is an instruction that describes a task, paired with an input that provides further context.
39+
Write a response that appropriately completes the request. If you are given options (A, B, C, D),
40+
return just the option letter and not the entire text. If you're required to return a JSON, only return a JSON and
41+
nothing else.
42+
43+
{prompt}
44+
"""
45+
46+
47+
@task
48+
def sevenllm_mcq_zh() -> Task:
49+
"""Inspect task implementing the SEvenLLM benchmark for MCQs in Simplified Chinese."""
50+
return Task(
51+
dataset=get_sevenllm_dataset(language="zh", data_format="mcq"),
52+
solver=[prompt_template(template=TEMPLATE), multiple_choice()],
53+
scorer=choice(),
54+
)
55+
56+
57+
@task
58+
def sevenllm_mcq_en() -> Task:
59+
"""Inspect task implementing the SEvenLLM benchmark for MCQs in English."""
60+
return Task(
61+
dataset=get_sevenllm_dataset(language="en", data_format="mcq"),
62+
solver=[prompt_template(template=TEMPLATE), multiple_choice()],
63+
scorer=choice(),
64+
)
65+
66+
67+
@task
68+
def sevenllm_qa_zh() -> Task:
69+
"""Inspect task implementing the SEvenLLM benchmark for QA in Simplified Chinese."""
70+
return Task(
71+
dataset=get_sevenllm_dataset(language="zh", data_format="qa"),
72+
solver=[prompt_template(template=TEMPLATE), generate()],
73+
scorer=[rouge_l_scorer(is_zh=True), semantic_similarity_scorer()],
74+
)
75+
76+
77+
@task
78+
def sevenllm_qa_en() -> Task:
79+
"""Inspect task implementing the SEvenLLM benchmark for QA in English."""
80+
return Task(
81+
dataset=get_sevenllm_dataset(language="en", data_format="qa"),
82+
solver=[prompt_template(template=TEMPLATE), generate()],
83+
scorer=[rouge_l_scorer(is_zh=False), semantic_similarity_scorer()],
84+
)
85+
86+
87+
def contains_zsh(text: str) -> bool:
88+
"""Return True if the text contains a simplified-chinese character."""
89+
# Regular expression to match Simplified Chinese characters
90+
# CJK Unified Ideographs range: \u4e00-\u9fff
91+
pattern = re.compile(r"[\u4e00-\u9fff]")
92+
93+
return bool(pattern.search(text))
94+
95+
96+
def record_to_sample(record: dict[str, Any]) -> Sample:
97+
"""Applies transformations to each record in the dataset for the Task."""
98+
instruction = record["instruction"]
99+
record_format = "qa" if isinstance(instruction, str) else "mcq"
100+
text = instruction if isinstance(instruction, str) else instruction["question"]
101+
record_language = "zh" if contains_zsh(text) else "en"
102+
103+
sample = {
104+
"id": record["id"],
105+
"input": f"{text}\n\n{record['input']}\n\n",
106+
"metadata": {
107+
"category": record["category"],
108+
"cot": record["thought"],
109+
"language": record_language,
110+
"format": record_format,
111+
},
112+
"target": json.dumps(record["output"], ensure_ascii=False)
113+
if record_format == "qa"
114+
else str(record["output"]),
115+
}
116+
117+
if record_format == "mcq":
118+
sample["choices"] = [
119+
str(instruction["choice"]["A"]),
120+
str(instruction["choice"]["B"]),
121+
str(instruction["choice"]["C"]),
122+
str(instruction["choice"]["D"]),
123+
]
124+
125+
return Sample(**sample)
126+
127+
128+
def get_sevenllm_dataset(language: str, data_format: str) -> Dataset:
129+
"""Get a filtered dataset from the SEvenLLM benchmark based on language and format."""
130+
# Cannot use `hf_dataset` here because the benchmark jsonl file contains json of
131+
# multiple schemas, which violates PyArrow validation checks.
132+
dataset = json_dataset(
133+
json_file=BENCHMARK_DATASET_URL,
134+
sample_fields=record_to_sample,
135+
)
136+
137+
return dataset.filter(
138+
lambda sample: sample.metadata["format"] == data_format
139+
and sample.metadata["language"] == language
140+
)

0 commit comments

Comments
 (0)