Skip to content

Commit a251298

Browse files
authored
Making evals multi-turn. Updating PI sys prompt (#20)
* Making evals multi-turn. Updating PI sys prompt * Removed references to last_checked_index * Change eval script invocation * update PI docs * adding unit test for this change
1 parent 293b1ae commit a251298

File tree

17 files changed

+462
-258
lines changed

17 files changed

+462
-258
lines changed
54.2 KB
Loading

docs/evals.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,26 @@ Evaluate guardrail performance against labeled datasets with precision, recall,
44

55
## Quick Start
66

7+
### Invocation Options
8+
Install the project (e.g., `pip install -e .`) and run the CLI entry point:
9+
```bash
10+
guardrails-evals --help
11+
```
12+
During local development you can run the module directly:
13+
```bash
14+
python -m guardrails.evals.guardrail_evals --help
15+
```
16+
717
### Basic Evaluation
818
```bash
9-
python guardrail_evals.py \
19+
guardrails-evals \
1020
--config-path guardrails_config.json \
1121
--dataset-path data.jsonl
1222
```
1323

1424
### Benchmark Mode
1525
```bash
16-
python guardrail_evals.py \
26+
guardrails-evals \
1727
--config-path guardrails_config.json \
1828
--dataset-path data.jsonl \
1929
--mode benchmark \
@@ -154,15 +164,15 @@ The evaluation tool supports OpenAI, Azure OpenAI, and any OpenAI-compatible API
154164

155165
### OpenAI (Default)
156166
```bash
157-
python guardrail_evals.py \
167+
guardrails-evals \
158168
--config-path config.json \
159169
--dataset-path data.jsonl \
160170
--api-key sk-...
161171
```
162172

163173
### Azure OpenAI
164174
```bash
165-
python guardrail_evals.py \
175+
guardrails-evals \
166176
--config-path config.json \
167177
--dataset-path data.jsonl \
168178
--azure-endpoint https://your-resource.openai.azure.com \
@@ -176,7 +186,7 @@ python guardrail_evals.py \
176186
Any model which supports the OpenAI interface can be used with `--base-url` and `--api-key`.
177187

178188
```bash
179-
python guardrail_evals.py \
189+
guardrails-evals \
180190
--config-path config.json \
181191
--dataset-path data.jsonl \
182192
--base-url http://localhost:11434/v1 \
@@ -198,4 +208,4 @@ python guardrail_evals.py \
198208
## Next Steps
199209

200210
- See the [API Reference](./ref/eval/guardrail_evals.md) for detailed documentation
201-
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
211+
- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code

docs/ref/checks/prompt_injection_detection.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -67,8 +67,14 @@ Returns a `GuardrailResult` with the following `info` dictionary:
6767
"confidence": 0.1,
6868
"threshold": 0.7,
6969
"user_goal": "What's the weather in Tokyo?",
70-
"action": "get_weather(location='Tokyo')",
71-
"checked_text": "Original input text"
70+
"action": [
71+
{
72+
"type": "function_call",
73+
"name": "get_weather",
74+
"arguments": "{'location': 'Tokyo'}"
75+
}
76+
],
77+
"checked_text": "[{'role': 'user', 'content': 'What is the weather in Tokyo?'}]"
7278
}
7379
```
7480

@@ -77,18 +83,18 @@ Returns a `GuardrailResult` with the following `info` dictionary:
7783
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
7884
- **`threshold`**: The confidence threshold that was configured
7985
- **`user_goal`**: The tracked user intent from conversation
80-
- **`action`**: The specific action being evaluated
81-
- **`checked_text`**: Original input text
86+
- **`action`**: The list of function calls or tool outputs analyzed for alignment
87+
- **`checked_text`**: Serialized conversation history inspected during analysis
8288

8389
## Benchmark Results
8490

8591
### Dataset Description
8692

87-
This benchmark evaluates model performance on a synthetic dataset of agent conversation traces:
93+
This benchmark evaluates model performance on agent conversation traces:
8894

89-
- **Dataset size**: 1,000 samples with 500 positive cases (50% prevalence)
90-
- **Data type**: Internal synthetic dataset simulating realistic agent traces
91-
- **Test scenarios**: Multi-turn conversations with function calls and tool outputs
95+
- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
96+
- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
97+
- **Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
9298
- **Misalignment examples**: Unrelated function calls, harmful operations, and data leakage
9399

94100
**Example of misaligned conversation:**
@@ -107,12 +113,12 @@ This benchmark evaluates model performance on a synthetic dataset of agent conve
107113

108114
| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
109115
|---------------|---------|-------------|-------------|-------------|-----------------|
110-
| gpt-5 | 0.9997 | 1.000 | 1.000 | 1.000 | 0.998 |
111-
| gpt-5-mini | 0.9998 | 1.000 | 1.000 | 0.998 | 0.998 |
112-
| gpt-5-nano | 0.9987 | 0.996 | 0.996 | 0.996 | 0.996 |
113-
| gpt-4.1 | 0.9990 | 1.000 | 1.000 | 1.000 | 0.998 |
114-
| gpt-4.1-mini (default) | 0.9930 | 1.000 | 1.000 | 1.000 | 0.986 |
115-
| gpt-4.1-nano | 0.9431 | 0.982 | 0.845 | 0.695 | 0.000 |
116+
| gpt-5 | 0.9604 | 0.998 | 0.995 | 0.963 | 0.431 |
117+
| gpt-5-mini | 0.9796 | 0.999 | 0.999 | 0.966 | 0.000 |
118+
| gpt-5-nano | 0.8651 | 0.963 | 0.963 | 0.951 | 0.056 |
119+
| gpt-4.1 | 0.9846 | 0.998 | 0.998 | 0.998 | 0.000 |
120+
| gpt-4.1-mini (default) | 0.9728 | 0.995 | 0.995 | 0.995 | 0.000 |
121+
| gpt-4.1-nano | 0.8677 | 0.974 | 0.974 | 0.974 | 0.000 |
116122

117123
**Notes:**
118124

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ packages = ["src/guardrails"]
7676

7777
[project.scripts]
7878
guardrails = "guardrails.cli:main"
79+
guardrails-evals = "guardrails.evals.guardrail_evals:main"
7980

8081
[tool.ruff]
8182
line-length = 150

src/guardrails/agents.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -166,14 +166,6 @@ class ToolConversationContext:
166166
def get_conversation_history(self) -> list:
167167
return self.conversation_history
168168

169-
def get_injection_last_checked_index(self) -> int:
170-
"""Return 0 to check all messages (required by prompt injection check)."""
171-
return 0
172-
173-
def update_injection_last_checked_index(self, new_index: int) -> None:
174-
"""No-op (required by prompt injection check interface)."""
175-
pass
176-
177169
return ToolConversationContext(
178170
guardrail_llm=base_context.guardrail_llm,
179171
conversation_history=conversation_history,

0 commit comments

Comments
 (0)