openai
diff --git a/‎docs/benchmarking/alignment_roc_curves.png‎
54.2 KB b/‎docs/benchmarking/alignment_roc_curves.png‎
54.2 KB
diff --git a/‎docs/evals.md‎
Lines changed: 16 additions & 6 deletions b/‎docs/evals.md‎
Lines changed: 16 additions & 6 deletions
diff --git a/‎docs/ref/checks/prompt_injection_detection.md‎
Lines changed: 20 additions & 14 deletions b/‎docs/ref/checks/prompt_injection_detection.md‎
Lines changed: 20 additions & 14 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/guardrails/agents.py‎
Lines changed: 0 additions & 8 deletions b/‎src/guardrails/agents.py‎
Lines changed: 0 additions & 8 deletions
@@ -4,16 +4,26 @@ Evaluate guardrail performance against labeled datasets with precision, recall,
 
 ## Quick Start
 
+### Invocation Options
+Install the project (e.g., `pip install -e .`) and run the CLI entry point:
+```bash
+guardrails-evals --help
+```
+During local development you can run the module directly:
+```bash
+python -m guardrails.evals.guardrail_evals --help
+```
+
 ### Basic Evaluation
 ```bash
-python guardrail_evals.py \
+guardrails-evals \
   --config-path guardrails_config.json \
   --dataset-path data.jsonl
 ```
 
 ### Benchmark Mode
 ```bash
-python guardrail_evals.py \
+guardrails-evals \
   --config-path guardrails_config.json \
   --dataset-path data.jsonl \
   --mode benchmark \
@@ -154,15 +164,15 @@ The evaluation tool supports OpenAI, Azure OpenAI, and any OpenAI-compatible API
 
 ### OpenAI (Default)
 ```bash
-python guardrail_evals.py \
+guardrails-evals \
   --config-path config.json \
   --dataset-path data.jsonl \
   --api-key sk-...
 ```
 
 ### Azure OpenAI
 ```bash
-python guardrail_evals.py \
+guardrails-evals \
   --config-path config.json \
   --dataset-path data.jsonl \
   --azure-endpoint https://your-resource.openai.azure.com \
@@ -176,7 +186,7 @@ python guardrail_evals.py \
 Any model which supports the OpenAI interface can be used with `--base-url` and `--api-key`.
 
 ```bash
-python guardrail_evals.py \
+guardrails-evals \
   --config-path config.json \
   --dataset-path data.jsonl \
   --base-url http://localhost:11434/v1 \
@@ -198,4 +208,4 @@ python guardrail_evals.py \
 ## Next Steps
 
 - See the [API Reference](./ref/eval/guardrail_evals.md) for detailed documentation
-- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
+- Use [Wizard UI](https://guardrails.openai.com/) for configuring guardrails without code
@@ -67,8 +67,14 @@ Returns a `GuardrailResult` with the following `info` dictionary:
     "confidence": 0.1,
     "threshold": 0.7,
     "user_goal": "What's the weather in Tokyo?",
-    "action": "get_weather(location='Tokyo')",
-    "checked_text": "Original input text"
+    "action": [
+        {
+            "type": "function_call",
+            "name": "get_weather",
+            "arguments": "{'location': 'Tokyo'}"
+        }
+    ],
+    "checked_text": "[{'role': 'user', 'content': 'What is the weather in Tokyo?'}]"
 }
 ```
 
@@ -77,18 +83,18 @@ Returns a `GuardrailResult` with the following `info` dictionary:
 - **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
 - **`threshold`**: The confidence threshold that was configured
 - **`user_goal`**: The tracked user intent from conversation
-- **`action`**: The specific action being evaluated
-- **`checked_text`**: Original input text
+- **`action`**: The list of function calls or tool outputs analyzed for alignment
+- **`checked_text`**: Serialized conversation history inspected during analysis
 
 ## Benchmark Results
 
 ### Dataset Description
 
-This benchmark evaluates model performance on a synthetic dataset of agent conversation traces:
+This benchmark evaluates model performance on agent conversation traces:
 
-- **Dataset size**: 1,000 samples with 500 positive cases (50% prevalence)
-- **Data type**: Internal synthetic dataset simulating realistic agent traces
-- **Test scenarios**: Multi-turn conversations with function calls and tool outputs
+- **Synthetic dataset**: 1,000 samples with 500 positive cases (50% prevalence) simulating realistic agent traces
+- **AgentDojo dataset**: 1,046 samples from AgentDojo's workspace, travel, banking, and Slack suite combined with the "important_instructions" attack (949 positive cases, 97 negative samples)
+- **Test scenarios**: Multi-turn conversations with function calls and tool outputs across realistic workplace domains
 - **Misalignment examples**: Unrelated function calls, harmful operations, and data leakage
 
 **Example of misaligned conversation:**
@@ -107,12 +113,12 @@ This benchmark evaluates model performance on a synthetic dataset of agent conve
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |---------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-5         | 0.9997  | 1.000       | 1.000       | 1.000       | 0.998           |
-| gpt-5-mini    | 0.9998  | 1.000       | 1.000       | 0.998       | 0.998           |
-| gpt-5-nano    | 0.9987  | 0.996       | 0.996       | 0.996       | 0.996           |
-| gpt-4.1       | 0.9990  | 1.000       | 1.000       | 1.000       | 0.998           |
-| gpt-4.1-mini (default) | 0.9930  | 1.000       | 1.000       | 1.000       | 0.986           |
-| gpt-4.1-nano  | 0.9431  | 0.982       | 0.845       | 0.695       | 0.000           |
+| gpt-5         | 0.9604  | 0.998       | 0.995       | 0.963       | 0.431           |
+| gpt-5-mini    | 0.9796  | 0.999       | 0.999       | 0.966       | 0.000           |
+| gpt-5-nano    | 0.8651  | 0.963       | 0.963       | 0.951       | 0.056           |
+| gpt-4.1       | 0.9846  | 0.998       | 0.998       | 0.998       | 0.000           |
+| gpt-4.1-mini (default) | 0.9728  | 0.995       | 0.995       | 0.995       | 0.000           |
+| gpt-4.1-nano  | 0.8677  | 0.974       | 0.974       | 0.974       | 0.000           |
 
 **Notes:**
 
 
@@ -76,6 +76,7 @@ packages = ["src/guardrails"]
 
 [project.scripts]
 guardrails = "guardrails.cli:main"
+guardrails-evals = "guardrails.evals.guardrail_evals:main"
 
 [tool.ruff]
 line-length = 150
 
@@ -166,14 +166,6 @@ class ToolConversationContext:
         def get_conversation_history(self) -> list:
             return self.conversation_history
 
-        def get_injection_last_checked_index(self) -> int:
-            """Return 0 to check all messages (required by prompt injection check)."""
-            return 0
-
-        def update_injection_last_checked_index(self, new_index: int) -> None:
-            """No-op (required by prompt injection check interface)."""
-            pass
-
     return ToolConversationContext(
         guardrail_llm=base_context.guardrail_llm,
         conversation_history=conversation_history,